Methods and systems for genomic analysis

ABSTRACT

A computer-implemented method for processing and/or analyzing nucleic acid sequencing data comprises receiving a first data input and a second data input. The first data input comprises untargeted sequencing data generated from a first nucleic acid sample obtained from a subject. The second data input comprises target-specific sequencing data generated from a second nucleic acid sample obtained from the subject. Next, with the aid of a computer processor, the first data input and the second data input are combined to produce a combined data set. Next, an output derived from the combined data set is generated. The output is indicative of the presence or absence of one or more polymorphisms of the first nucleic acid sample and/or the second nucleic acid sample.

CROSS-REFERENCE

This application is a continuation application of U.S. patentapplication Ser. No. 17/746,669, filed May 17, 2022, which is acontinuation application of U.S. patent application Ser. No. 16/952,507,filed Nov. 19, 2020, which is a continuation application of U.S. patentapplication Ser. No. 16/849,121, filed Apr. 15, 2020, which is acontinuation application of U.S. patent application Ser. No. 16/559,423,filed Sep. 3, 2019, which application is a continuation application ofU.S. patent application Ser. No. 16/226,592, filed Dec. 19, 2018, whichapplication is a continuation application of U.S. patent applicationSer. No. 15/967,280, filed Apr. 30, 2018, which application is acontinuation application of U.S. patent application Ser. No. 15/639,610,filed Jun. 30, 2017, now U.S. Pat. No. 10,032,000, which application isa continuation application of U.S. patent application Ser.No.14/871,020, filed Sep. 30, 2015, now U.S. Pat. No. 9,727,692, whichapplication is a continuation application of U.S. patent applicationSer. No. 14/474,034, filed Aug. 29, 2014, now U.S. Pat. No. 9,183,496,which application claims priority to U.S. Provisional Patent ApplicationSerial No. 61/872,611, filed Aug. 30, 2013, each of which is entirelyincorporated herein by reference.

BACKGROUND

Exome sequencing, while cost effective, may have the shortcoming of onlyprobing 1-2% of the genome. For many clinical applications of geneticanalysis, sensitivity to large copy number variations andmicro-deletions can be achieved using competitive hybridization array orwhole genome sequencing. However, the former has little or nosensitivity to novel or rare single nucleotide variations and/or othersmall variants while the latter is substantially more expensive thanexome sequencing but adds little sensitivity for clinicallyinterpretable variants.

SUMMARY

The present disclosure provides computer systems and methods that cananalyze a sample of a subject to, for example, identify the presence orabsence of one or more polymorphisms in the sample. Methods providedherein advantageously employ untargeted (e.g., whole genome) sequencingand target-specific sequencing to generate an output that indicates thepresence or absence of one or more polymorphisms in the sample of thesubject.

The present disclosure provides an assay that can generate a first setof data containing at least some or all variants (e.g., singlenucleotide variations or insertion deletion polymorphism) from the exomeof a nucleic sample of a subject, in addition to at least another assaythat can generate a second set of data that contains at least some orall variants (e.g., structural variants or copy number variations) fromthe whole genome of the nucleic sample of the subject. The first set ofdata and second set of data can be combined into a combined output. Insome cases, variants from the first set of data and variants from thesecond set of data are combined into the combined output. This canprovide a rapid and economical approach to identifying the presence orabsence of one or more polymorphisms in the nucleic sample of thesubject. Such assaying can be accomplished using nucleic acid sequencingor nucleic acid amplification (e.g., polymerase chain reaction (PCR))with two different sample preparation methods that are directed toidentifying variants in the exome and variants in the genome. Forexample, variants in the exome can be identified using target-specificsequencing (e.g., using target specific primers), and variants in thewhole genome can be identified using untargeted sequencing.

An aspect of the present disclosure provides a computer-implementedmethod, comprising: (a) receiving a first data input, wherein the firstdata input comprises untargeted sequencing data generated from a firstnucleic acid sample of a subject; (b) receiving a second data input,wherein the second data input comprises target-specific sequencing datagenerated from a second nucleic acid sample of the subject; (c)combining, with the aid of a computer processor, the first data inputand the second data input to produce a combined data set; and (d)generating, with the aid of a computer processor, an output derived fromthe combined data set, wherein the output is indicative of the presenceor absence of one or more polymorphisms of the first nucleic acid sampleand/or the second nucleic acid sample.

In some embodiments, the first nucleic acid sample and the secondnucleic acid sample are the same sample. Alternatively, the firstnucleic acid sample and the second nucleic acid sample are differentsamples.

In an embodiment, combining the first data input and the second datainput comprises removing redundant sequences. In another embodiment, thecombined data set contains no redundant sequence information. In anembodiment, the output comprises a first alignment, wherein the firstalignment is generated by mapping the first data input onto a firstreference sequence. In another embodiment, the output further comprisesa second alignment, wherein the second alignment is generated by mappingthe second data input onto a second reference sequence. In anotherembodiment, the output comprises a uniform alignment, wherein theuniform alignment is generated by combining the first alignment with thesecond alignment. In another embodiment, combining the first alignmentwith the second alignment comprises removing redundant sequences. Inanother embodiment, the output comprises a uniform alignment, whereinthe uniform alignment is generated by mapping the combined data set ontoa reference sequence. In another embodiment, the uniform alignmentcontains no redundant sequence information.

In an embodiment, some or all of the reference sequences can includesequences that are from a human subject.

In an embodiment, the target-specific sequencing data is based ontargeted sequencing of exomes, specific genes, genomic regions, or acombination thereof. In another embodiment, the target-specificsequencing data is generated from the use of one or more non-randomprimers. In another embodiment, the non-random primers are chemicallysynthesized. In another embodiment, the non-random primers compriseprimers targeting one or more genes, exons, untranslated regions, or acombination thereof. In another embodiment, a target specific primer canbe a non-random primer. For example, a primer targeting one or moreexons can be a non-random primer. In another embodiment, the non-randomprimers have sequences that are designed to be complementary to knowngenomic regions. In another embodiment, the genomic regions comprise oneor more polymorphisms, sets of genes, sets of regulatory elements,micro-deletions, homopolymers, simple tandem repeats, regions of high GCcontent, regions of low GC content, paralogous regions, or a combinationthereof. In another embodiment, the one or more polymorphisms compriseone or more single nucleotide variations, copy number variations,insertions, deletions, structural variant junctions, variable lengthtandem repeats, or a combination thereof. In another embodiment, theuntargeted sequencing data is generated from the use of one or morerandom primers. In another embodiment, the random primers are chemicallysynthesized. In another embodiment, the random primers have sequencesthat are not designed to be complementary to known genomic regions. Inanother embodiment, the random primers comprise random hexamer primers.In another embodiment, the hexamer primers comprise oligonucleotidesequences of 6 bases which are synthesized entirely randomly. In anotherembodiment, the untargeted sequencing data comprises between about 0.5to about 5 gigabases. In another embodiment, the untargeted sequencingdata comprises at least about 10 megabases. In another embodiment, theuntargeted sequencing data comprises at least about 300 megabases. Inanother embodiment, the untargeted sequencing data comprises less thanabout 100 gigabases. In another embodiment, the untargeted sequencingdata comprises less than about 5 gigabases. In another embodiment, theanalysis of the first data input comprises assigning data into genomicbins of between about 100 to about 1,000,000 basepairs. In anotherembodiment, the analysis of the first data input comprises assigningdata into genomic bins of at least about 1 kilobasepairs. In anotherembodiment, the analysis of the first data input comprises assigningdata into genomic bins of less than about 100 megabasepairs. In anotherembodiment, the analysis of the first data input comprises assigningdata into genomic bins of 50 kilobasepairs. In another embodiment, theanalysis of the first data input comprises assigning data into genomicbins of 1 megabasepairs. In another embodiment, the untargetedsequencing data is whole genome sequencing data, off-target data arisingfrom a targeted assay, or a combination thereof. In another embodiment,the whole genome sequencing data comprises single reads. In anotherembodiment, the whole genome sequencing data comprises paired-end reads.In another embodiment, the whole genome sequencing data comprisesmate-pair reads. In another embodiment, the paired-end reads haveinsert-sizes of larger than about 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 3.0,4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,17.0, 18.0, 19.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, or100.0 kilobasepairs. In another embodiment, the mate-pair reads haveinsert-sizes of larger than about 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 3.0,4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0,17.0, 18.0, 19.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, or100.0 kilobasepairs. In another embodiment, the untargeted sequencingdata is obtained at a single nucleotide variations detection sensitivitythat is less than or equal to about 80%. In another embodiment, theuntargeted sequencing data is obtained at a single nucleotide variationsdetection sensitivity that is at least about 5%, 10%, 20%, 30%, 40%,50%, 60%, 70%, 75%, 80%, 85%, 90%, 95%, or 99%. In another embodiment,the untargeted sequencing data is obtained at a single nucleotidevariations detection sensitivity that is from about 5% to 50%.

In another embodiment, the first data input and the second data inputare combined to produce a combined data set. In another embodiment, anoutput is generated from the combined data set. In another embodiment,the generation of the output comprises assigning data from the combineddata set into genomic bins of between about 100 to about 1,000,000basepairs. In another embodiment, the generation of the output comprisesassigning data from the combined data set into genomic bins of at leastabout 100, 200, 300, 400, 500, 600, 700, 800, or 900 basepairs. Inanother embodiment, the generation of the output comprises assigningdata from the combined data set into genomic bins of at least about 1,2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200,300, 400, 500, 600, 700, 800, or 900 kilobasepairs. For example, thegeneration of the output comprises assigning data from the combined dataset into genomic bins of at least about 1 kilobasepairs. In anotherembodiment, the generation of the output comprises assigning data fromthe combined data set into genomic bins of at least about 1, 2, 3, 4, 5,6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500,600, 700, 800, 900 or 1000 megabasepairs. In another embodiment, thegeneration of the output comprises assigning data from the combined dataset into genomic bins of less than about 100, 200, 300, 400, 500, 600,700, 800, or 900 basepairs. In another embodiment, the generation of theoutput comprises assigning data from the combined data set into genomicbins of less than about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50,60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, or 900kilobasepairs. In another embodiment, the generation of the outputcomprises assigning data from the combined data set into genomic bins ofless than about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70,80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900 or 1000megabasepairs. For example, the generation of the output comprisesassigning data from the combined data set into genomic bins of less thanabout 100 mega basepairs. In another embodiment, the generation of theoutput comprises assigning data from the combined data set into genomicbins of about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80,90, 100, 200, 300, 400, 500, 600, 700, 800, 900 or 1000 kilobasepairs.For example, the generation of the output comprises assigning data fromthe combined data set into genomic bins of 50 kilobasepairs. In anotherembodiment, the generation of the output comprises assigning data fromthe combined data set into genomic bins of less than about 1, 2, 3, 4,5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400,500, 600, 700, 800, 900 or 1000 megabasepairs. For example, thegeneration of the output comprises assigning data from the combined dataset into genomic bins of 1 mega basepairs. In another embodiment, theoutput is based on data from one or more databases or data sources. Inanother embodiment, the first and/or second reference sequence comprisesdata from one or more databases or data sources. In another embodiment,the first reference sequence and the second reference sequence are thesame sequence. In another embodiment, the first reference sequence andthe second reference sequence are different sequences. In anotherembodiment, the one or more databases or data sources are selected fromthe group consisting of medical records, clinical notes, genomicdatabases, biomedical databases, clinical databases, scientificdatabases, disease databases, variant databases and biomarker databases.In another embodiment, the one or more databases or data sourcescomprise proprietary databases. In another embodiment, the proprietarydatabases are selected from the group consisting of a disease database,variant database and pharmacogenomics database. In another embodiment,the one or more databases or data sources comprise publicly-availabledatabases. In another embodiment, the publicly-available databases areselected from the group consisting of Orphanet, Human PhenotypeOntology, Online Mendelian Inheritance in Man, Model Organism GeneKnock-Out databases, Kegg Disease Database, and db SNP. In anotherembodiment, the one or more polymorphisms are selected from the groupconsisting of copy number variations, single nucleotide variations,structural variants, micro-deletions, polymorphisms, insertions anddeletions. In another embodiment, the output is generated with the aidof one or more machine-executed algorithms. In another embodiment, theoutput is generated with the aid of one or more statistical models. Inanother embodiment, the one or more statistical models comprise a Markovmodel. In another embodiment, the Markov model is a Hidden Markov Model.In another embodiment, the Hidden Markov Model is given an internalstate, wherein the internal state is set according to an overall copynumber of a chromosome in the first or second nucleic acid sample. Inanother embodiment, the Hidden Markov Model is used to filter the outputby examination of measured insert-sizes of reads near a detectedfeature's breakpoint(s). In another embodiment, the output furthercomprises detection of one or more haplotypes. In another embodiment,the target-specific sequencing data is obtained at a single nucleotidevariations sensitivity that is greater than or equal to about 50%, 60%,70%, 75%, 80%, 85%, 90%, 95%, or 99%. In another embodiment, the outputof the combined data set is displayed on a graphical user interface ofan electronic display coupled to the computer. In another embodiment,the output is displayed in numeric and/or graphical form. In anotherembodiment, the output is an electronic report. In another embodiment,the first nucleic acid sample and the second nucleic acid sample are thesame sample. In another embodiment, the output has coverage of at leastabout 50%, 60%, 70%, 75%, 80%, 85%, 90%, 95%, or 99% of the whole genomeof the subject. In another embodiment, the output has coverage of thewhole genome of the subject.

In some embodiments, the target-specific sequencing data comprises aspecific portion and a non-specific portion, and wherein at least aportion of the untargeted sequencing data is the non-specific portion ofthe target-specific sequencing data. In an embodiment, the untargetedsequencing data is the non-specific portion of the target-specificsequencing data. In another embodiment, the targeted-specific sequencingdata is whole exome sequencing data. In another embodiment, the specificportion is exonic portion of the whole exome sequencing data and thenon-specific portion is non-exonic portion of the whole exome sequencingdata.

In some embodiments, each of the first data input and the second datainput comprises variant data, and the combining comprises combining thevariant data from the first data input and the second data input intothe combined data set. In an embodiment, the first data input comprisescopy number and/or structural variant data, and the second data inputcomprises single nucleotide variations (SNV) and/or insertion deletionpolymorphism (indel) data. In another embodiment, the method furthercomprises performing untargeted sequencing on the first nucleic acidsample to generate the first data input. In another embodiment, themethod further comprises performing target-specific sequencing on thesecond nucleic acid sample to generate the second data input.

In another aspect of the present disclosure, a system for genomicsequencing comprises: (a) one or more memory locations (e.g., flashmemory, hard disk) comprising a first data input and second data input,wherein the first data input comprises untargeted sequencing data andthe second data input comprises target-specific sequencing data of orrelated to a genome of a subject or a portion thereof; (b) a computerprocessor operably coupled to the one or more memory locations, whereinthe computer processor is programmed to combine the first data input andthe second data input to produce a combined data set; and (c) anelectronic display coupled to the computer processor, wherein theelectronic display presents an output derived from at least a portion ofthe combined data set, which output is indicative of the presence orabsence of one or more polymorphisms in the genome of the subject or theportion thereof. In an embodiment, the electronic display comprises agraphical user interface that is configured to display the at least theportion of the combined data set.

In another aspect of the present disclosure, a system for genomicsequencing comprises: (a) at least one memory location comprising afirst data input and second data input, wherein the first data inputcomprises untargeted sequencing data and the second data input comprisestarget-specific sequencing data of or related to a genome of a subjector a portion of the genome, wherein the first data input is obtainedfrom a first nucleic acid sample of the subject and the second datainput is obtained from a second nucleic acid sample of the subject; (b)a computer processor operably coupled to the at least one memorylocation, wherein the computer processor is programmed to (i) combinethe first data input and the second data input to produce a combineddata set and (ii) generate an output from at least a portion of thecombined data set, wherein the output is indicative of the presence orabsence of one or more polymorphisms in the genome of the subject or aportion of the genome; and (c) an electronic display coupled to thecomputer processor, wherein the electronic display provides the outputfor display to a user. In an embodiment, the first data input does notcontain target-specific sequencing data. In another embodiment, thesecond data input contains less than 5%, 10%, 15%, 20%, 25%, 30%, 35%,40%, 45%, or 50% non-targeted sequence. In another aspect, the computerprocessor of the system herein is programmed to map the first data inputonto a first reference sequence in memory to generate a first alignment.In one embodiment, the computer processor is programmed to map thesecond data input onto a second reference sequence in memory to generatea second alignment. In another embodiment, the computer processor isprogrammed to combine the first alignment and the second alignment togenerate a uniform alignment in the output. In another embodiment, thecomputer processor is programed to remove redundant sequences uponcombining the first alignment and the second alignment. In anotherembodiment, the computer processor is programmed to remove redundantsequences upon combining the first data input and the second data input.In another embodiment, the output comprises a uniform alignment that isgenerated by mapping the combined data set onto one or more referencesequences. In another embodiment, the uniform alignment contains noredundant sequence information. In another embodiment, the output isgenerated from all of the combined data set. In another embodiment, thefirst nucleic acid sample and the second nucleic acid sample are thesame sample. In another embodiment, the electronic display is part of aremote computer system of the user. In another embodiment, each of thefirst data input and the second data input comprises variant data, andthe computer processor combines the variant data from the first datainput and the second data input into the combined data set. In anotherembodiment, the first data input comprises copy number and/or structuralvariant data, and the second data input comprises single nucleotidevariations (SNV) and/or insertion deletion polymorphism (indel) data.

In another aspect of the present disclosure, a system for genomicsequencing comprises: (a) at least one memory location comprising afirst data input and second data input, wherein the first data inputcomprises untargeted sequencing data and the second data input comprisestarget-specific sequencing data of or related to a genome of a subjector a portion of the genome, wherein the first data input is obtainedfrom a first nucleic acid sample of the subject and the second datainput is obtained from a second nucleic acid sample of the subject; (b)a computer processor coupled to the at least one memory location,wherein the computer processor is programmed to (i) combine the firstdata input and the second data input to produce a combined data set, and(ii) generate an output from at least a portion of the combined dataset, wherein the output is indicative of the presence or absence of oneor more polymorphisms in the genome of the subject or a portion of thegenome; and (c) an electronic data storage unit coupled to the computerprocessor, wherein the electronic data storage unit comprises thecombined data set and/or the output. In another aspect, the electronicdata storage unit described herein comprises the combined data set andthe output. In one embodiment, the first data input does not containtarget-specific sequencing data. In another embodiment, the second datainput contains less than 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, or50% non-targeted sequence. In another embodiment, the computer processoris programmed to map the first data input onto a first referencesequence in memory to generate a first alignment. In another embodiment,the computer processor is programmed to map the second data input onto asecond reference sequence in memory to generate a second alignment. Inanother embodiment, the computer processor is programmed to combine thefirst alignment and the second alignment to generate a uniform alignmentin the output. In another embodiment, the computer processor isprogramed to remove redundant sequences upon combining the firstalignment and the second alignment. In another embodiment, the computerprocessor is programmed to remove redundant sequences upon combining thefirst data input and the second data input. In another embodiment, theoutput comprises a uniform alignment that is generated by mapping thecombined data set onto one or more reference sequences. In anotherembodiment, the uniform alignment contains no redundant sequenceinformation. In another embodiment, the first nucleic acid sample andthe second nucleic acid sample are the same sample. In anotherembodiment, the output is generated from all of the combined data set.In another embodiment, each of the first data input and the second datainput comprises variant data, and the computer processor combines thevariant data from the first data input and the second data input intothe combined data set. In another embodiment, the first data inputcomprises copy number and/or structural variant data, and the seconddata input comprises single nucleotide variations (SNV) and/or insertiondeletion polymorphism (indel) data.

In another aspect of the present disclosure, a computer-implementedmethod comprises: (a) receiving a data input comprising target-specificsequencing data generated from a nucleic acid sample of a subject; (b)annotating, with the aid of a computer processor, the data as targetedor untargeted; and (c) generating, using a computer processor, an outputthat includes an analysis of the untargeted data, wherein the output isindicative of the absence or presence of one or more polymorphisms inthe nucleic acid sample. In an embodiment, the output further comprisesanalysis of the targeted data. In another embodiment, thetarget-specific sequencing data is directed to exons, selected genes,genomic regions, variants, or a combination thereof. In anotherembodiment, the genomic regions comprise one or more polymorphisms, setsof genes, sets of regulatory elements, micro-deletions, homopolymers,simple tandem repeats, regions of high GC content, regions of low GCcontent, paralogous regions, or a combination thereof. In anotherembodiment, the one or more polymorphisms comprise one or moreinsertions, deletions, structural variant junctions, variable lengthtandem repeats, single nucleotide variants, copy number variants, or acombination thereof. In another embodiment, the one or morepolymorphisms are selected from copy number variations, singlenucleotide variations, structural variants, micro-deletions,polymorphisms or a combination thereof. In another embodiment, themethod further comprises receiving one or more additional data inputscomprising sequencing data generated from one or more additional nucleicacid samples. In another embodiment, the one or more additional datainputs are generated from untargeted sequencing, target-specificsequencing, or a combination thereof. In another embodiment, the one ormore additional nucleic acid samples are of the subject. In anotherembodiment, the method further comprises generating one or morebiomedical reports at least in part based on the output. In anotherembodiment, the method further comprises determining, administering, ormodifying a therapeutic regimen for a subject based at least in part onthe output. In another embodiment, the method further comprisesdiagnosing, predicting, or monitoring a disease or a condition in asubject based at least in part on the output.

In another aspect of the present disclosure, a system for analyzingnucleic acid sequencing data comprises: (a) one or more memory locationscomprising a data input containing target-specific sequencing datagenerated from a nucleic acid sample of a subject; and (b) a computerprocessor coupled to the one or more memory locations and programmed to(i) annotate the data input as targeted or untargeted sequencing data,and (ii) perform an analysis on untargeted sequencing data in the datainput to identify the presence or absence of one or more polymorphisms,and (iii) generate an output based on the analysis of the untargetedsequencing data, wherein the output is indicative of the presence orabsence of one or more polymorphisms. In an embodiment, the computerprocessor is programmed to analyze the target-specific sequencing data.In another embodiment, the target-specific sequencing data is wholeexome sequencing. In another embodiment, the target-specific sequencingdata is from the use of one or more nonrandom primers. In anotherembodiment, at least about 10% of the data input is annotated asuntargeted sequencing data. In another embodiment, the untargetedsequencing data comprises non-exonic sequencing data. In anotherembodiment, less than about 90% of the data input is annotated astarget-specific sequencing data. In another embodiment, thetarget-specific sequencing data comprises sequencing data pertaining toexomes, genes, genomic regions, or a combination thereof. In anotherembodiment, the data input further comprises untargeted sequencing data.In another embodiment, the untargeted sequencing data is generated fromthe nucleic acid sample. In another embodiment, the untargetedsequencing data is whole genome sequencing data.

Another aspect of the present disclosure provides a computer readablemedium that comprises machine executable code that, upon execution byone or more computer processors, implements any of the methods above orelsewhere herein.

Another aspect of the present disclosure provides a computer systemcomprising one or more computer processors and a memory location coupledthereto. The memory location comprises machine executable code that,upon execution by the one or more computer processors, implements any ofthe methods above or elsewhere herein.

Additional aspects and advantages of the present disclosure will becomereadily apparent to those skilled in this art from the followingdetailed description, wherein only illustrative embodiments of thepresent disclosure are shown and described. As will be realized, thepresent disclosure is capable of other and different embodiments, andits several details are capable of modifications in various obviousrespects, all without departing from the disclosure. Accordingly, thedrawings and description are to be regarded as illustrative in nature,and not as restrictive.

Incorporation by Reference

All publications, patents, and patent applications mentioned in thisspecification are herein incorporated by reference to the same extent asif each individual publication, patent, or patent application wasspecifically and individually indicated to be incorporated by reference.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features of the invention are set forth with particularity inthe appended claims. A better understanding of the features andadvantages of the present invention will be obtained by reference to thefollowing detailed description that sets forth illustrative embodiments,in which the principles of the invention are utilized, and theaccompanying drawings (also “FIG.” and “FIGS.” herein), of which:

FIG. 1 shows a computer system for implementing the methods of thepresent disclosure;

FIG. 2A-2C depict a schematic of four workflows of the presentdisclosure. “Prep 1” and “Prep 2” refer to subsets of nucleic acidmolecules; “Assay 1” and “Assay 2” refer to assays. FIG. 2A-2C depictassay and analysis workflows using elements of a more complex workflow;

FIG. 3 illustrates a statistical model on sensitivity vs. coverage ofgenomic regions with single run whole genome sequencing data;

FIG. 4 illustrates a successful detection of a heterozygous deletion ina single run whole genome sequencing data using the Hidden Markov Model;

FIG. 5 illustrates the effects of different genomic bin sizes in wholegenome sequencing; and

FIG. 6 illustrates other factors that can lead to systematic read-depthvariation, besides the presence of copy number variations.

DETAILED DESCRIPTION

While various embodiments of the invention have been shown and describedherein, it will be obvious to those skilled in the art that suchembodiments are provided by way of example only. Numerous variations,changes, and substitutions may occur to those skilled in the art withoutdeparting from the invention. It should be understood that variousalternatives to the embodiments of the invention described herein may beemployed.

The term “subject,” as used herein, generally refers to an individualhaving at least one biological sample that is undergoing analysis. Thesubject can be undergoing analysis to diagnose, predict or monitor ahealth, health condition, or well-being of the subject, such as, forexample, identify or monitor a disease condition (e.g., cancer) in thesubject. The subject can have a sample that is undergoing analysis by aresearcher or a service provider, such as a healthcare professional orother individual or entity that employs methods and systems of thepresent disclosure to analyze the sample.

The term “nucleic acid” as used herein generally refers to a polymericform of nucleotides of any length. Nucleic acids can includeribonucleotides, deoxyribonucleotides or peptide nucleic acids (PNAs),that comprise purine and pyrimidine bases, or other natural, chemicallyor biochemically modified, non-natural, or derivatized nucleotide bases.A nucleic acid can be single or double stranded. The backbone of thepolynucleotide can comprise sugars and phosphate groups, as maytypically be found in RNA or DNA, or modified or substituted sugar orphosphate groups. A polynucleotide may comprise modified nucleotides,such as methylated nucleotides and nucleotide analogs. The sequence ofnucleotides may be interrupted by non-nucleotide components. Thus theterms nucleoside, nucleotide, deoxynucleoside and deoxynucleotidegenerally include analogs such as those described herein. These analogsare those molecules having some structural features in common with anaturally occurring nucleoside or nucleotide such that when incorporatedinto a nucleic acid or oligonucleotide sequence, they allowhybridization with a naturally occurring nucleic acid sequence insolution. Typically, these analogs are derived from naturally occurringnucleosides and nucleotides by replacing and/or modifying the base, theribose, or the phosphodiester moiety. The changes can be tailor made tostabilize or destabilize hybrid formation or enhance the specificity ofhybridization with a complementary nucleic acid sequence as desired. Thenucleic acid molecule may be a DNA molecule. The nucleic acid moleculemay be an RNA molecule. The nucleic acid molecule may be a syntheticmolecule.

The terms “variant or derivative of a nucleic acid molecule” and“derivative or variant of a nucleic acid molecule,” as used herein,generally refer to a nucleic acid molecule comprising a polymorphism.The terms “variant or derivative of a nucleic acid molecule” or“derivative or variant of a nucleic acid molecule” may also refer tonucleic acid product that is produced from one or more assays conductedon the nucleic acid molecule. For example, a fragmented nucleic acidmolecule, hybridized nucleic acid molecule (e.g., capture probehybridized nucleic acid molecule, bead bound nucleic acid molecule),amplified nucleic acid molecule, isolated nucleic acid molecule, elutednucleic acid molecule, and enriched nucleic acid molecule are variantsor derivatives of the nucleic acid molecule.

The terms “detectable label” or “label,” as used herein, generally referto any chemical moiety attached to a nucleotide, nucleotide polymer, ornucleic acid binding factor, wherein the attachment may be covalent ornon-covalent. The label can be detectable and render the nucleotide ornucleotide polymer detectable to a user or a system operated by theuser. The terms “detectable label” or “label” may be usedinterchangeably. Detectable labels that may be used in combination withthe methods disclosed herein include, for example, a fluorescent label,a chemiluminescent label, a quencher, a radioactive label, biotin,quantum dot, gold, or a combination thereof. Detectable labels includeluminescent molecules, fluorochromes, fluorescent quenching agents,colored molecules, radioisotopes or scintillants. Detectable labels alsoinclude any useful linker molecule (such as biotin, avidin,streptavidin, HRP, protein A, protein G, antibodies or fragmentsthereof, Grb2, polyhistidine, Ni²+, FLAG tags, myc tags), heavy metals,enzymes (examples include alkaline phosphatase, peroxidase andluciferase), electron donors/acceptors, acridinium esters, dyes andcalorimetric substrates. It is also envisioned that a change in mass maybe considered a detectable label, as is the case of surface plasmonresonance detection.

The terms “bound”, “hybridized”, “conjugated”, “attached”, and “linked”can be used interchangeably and generally refer to the association of anobject to another object. The association of the two objects to eachother may be from a covalent or non-covalent interaction. For example, acapture probe hybridized nucleic acid molecule refers to a capture probeassociated with a nucleic acid molecule. The capture probe and thenucleic acid molecule are in contact with each other. In anotherexample, a bead bound nucleic acid molecule refers to a bead associatedwith a nucleic acid molecule.

The terms “target-specific”, “targeted,” and “specific” can be usedinterchangeably and generally refer to a subset of the genome that is aregion of interest, or a subset of the genome that comprises specificgenes or genomic regions. For example, the specific genomic regions canbe a region that is guanine and cytosine (GC) rich. Targeted sequencingmethods can allow one to selectively capture genomic regions of interestfrom a nucleic acid sample prior to sequencing. Targeted sequencinginvolves alternate methods of sample preparation that produce librariesthat represent a desired subset of the genome or to enrich the desiredsubset of the genome. The terms “untargeted sequencing” or “non-targetedsequencing” can be used interchangeably and generally refer to asequencing method that does not target or enrich a region of interest ina nucleic acid sample. The terms “untargeted sequence”, “non-targetedsequence,” or “non-specific sequence” generally refer to the nucleicacid sequences that are not in a region of interest or to sequence datathat is generated by a sequencing method that does not target or enricha region of interest in a nucleic acid sample. The terms “untargetedsequence”, “non-targeted sequence” or “non-specific sequence” can alsorefer to sequence that is outside of a region of interest. In somecases, sequencing data that is generated by a targeted sequencing methodcan comprise not only targeted sequences but also untargeted sequences.

Where a range of values is provided, it is understood that eachintervening value between the upper and lower limits of that range, tothe tenth of the unit of the lower limit, unless the context clearlydictates otherwise, is also specifically disclosed. Each smaller rangebetween any stated value or intervening value in a stated range, and anyother stated or intervening value in that stated range is encompassedwithin the invention. The upper and lower limits of these smaller rangesmay independently be included or excluded in the range, and each rangewhere either, neither or both limits are included in the smaller rangesis also encompassed within the invention, subject to any specificallyexcluded limit in the stated range. Where the stated range includes oneor both of the limits, ranges excluding either or both of those includedlimits are also included in the invention.

As will be apparent to those of skill in the art upon reading thisdisclosure, each of the individual embodiments described and illustratedherein has discrete components and features which may be readilyseparated from or combined with the features of any of the other severalembodiments without departing from the scope or spirit of the presentinvention. Any recited method can be carried out in the order ofoperations recited or disclosed, or in any other logical order.

Sample Processing and Data Analysis

This disclosure provides computer-implemented methods and systems forsample processing and data analysis such as genome sequencing or othertypes of sequencing. The computer-implemented method, genome sequencing,or sequencing methods can comprise receiving a first data inputcomprising untargeted sequencing data (i.e., whole genome sequencingdata), and a second data input comprising target-specific sequencingdata, followed by combining or analyzing the first and second datainputs and generating an output derived from the combined data oranalysis. The untargeted sequencing data (i.e., whole genome sequencingdata) can comprise single reads or, for example, between about 1 toabout 5 gigabases or less. The untargeted sequencing data (i.e., wholegenome sequencing data) can comprise paired-end reads. The untargetedsequencing data (i.e., whole genome sequencing data) can comprisemate-pair reads. The paired-end reads and/or the mate-pair reads mayhave insert-sizes of larger than about 0.1, 0.2, 0.3, 0.4, 0.5, 0.6,0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0,3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0,16.0, 17.0, 18.0, 19.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0,or 100.0 kilobasepairs. When paired-end reads and/or mate-pair reads areused, an insert-size can be the size of DNA inserted between theadaptors which enable amplification and sequencing of the DNA. Thetarget-specific sequencing may be exome sequencing. The output canprovide sequencing information that comprises sets of genes orregulatory elements, or one or more polymorphisms of genomic regions.

In some cases, sample processing includes nucleic acid sample processingand subsequent nucleic acid sample sequencing such as genome sequencingor other types of sequencing. Some or all of a nucleic acid sample maybe sequenced to provide sequence information including untargetedsequencing or target-specific sequencing, which may be received, stored,analyzed or otherwise maintained in an electronic, magnetic, or opticalstorage location or by a computer. The sequence information may beanalyzed or implemented with the aid of a computer processor, and theanalyzed sequence information may be stored in an electronic storagelocation. The electronic storage location may include a pool orcollection of sequence information and analyzed sequence informationgenerated from the nucleic acid sample.

In some cases, the subjects of the present disclosure may be mammals ornon-mammals. Preferably the subjects are a mammal, such as, a human,non-human primate (e.g., apes, monkeys, chimpanzees), cat, dog, rabbit,goat, horse, cow, pig, and sheep. Even more preferably, the subject is ahuman. For example, sequences of the present disclosure can includesequences that are from a human subject. In some cases, referencesequences of the present disclosure can include sequences that are froma human subject. The subject may be male or female; the subject may be afetus, infant, child, adolescent, teenager or adult. Non-mammalsinclude, but are not limited to, reptiles, amphibians, avians, and fish.A reptile may be a lizard, snake, alligator, turtle, crocodile, andtortoise. An amphibian may be a toad, frog, newt, and salamander.Examples of avians include, but are not limited to, ducks, geese,penguins, ostriches, and owls. Examples of fish include, but are notlimited to, catfish, eels, sharks, and swordfish.

In some examples, a user, such as a healthcare provider, may request afirst set of sequence information, a first data input, or analyzedsequence information from the pool. Concurrently or subsequently, theuser may request a second set of sequence information, a second datainput or analyzed sequence information from the pool. The first set maybe different from the second set. The first set may be combined with thesecond set with the aid of a computer processor to generate an outputcomprising the combined data. The sequencing information can comprisesets of genes or regulatory elements, or one or more polymorphisms. Theone or more polymorphisms can be of genomic regions. The whole genomesequencing data can comprise single reads or between about 1 to about 5gigabases or less.

The target-specific sequencing can include data from a targetedsequencing assay, e.g., exome sequencing. The target-specific sequencingdata may comprise a targeted portion and an untargeted portion of thesequencing data, i.e., an exonic data and a non-exonic data. Theuntargeted portion of the target-specific sequencing data, such as thenon-exonic data, may provide sequence information of genomic regionsspanning part or all of the whole genome regions outside of the targetedregion or exons. The computer-implemented methods or methods of genomesequencing and/or sequencing may generate an output that is indicativeof or comprising sequence information of the genomic regions comprisingone or more polymorphisms, sets of genes, sets of regulatory elements,micro-deletions, homopolymers, simple tandem repeats, regions of high GCcontent, regions of low GC content, paralogous regions or a combinationthereof based on the exonic and non-exonic data from exome sequencingdata. The GC-content can be the percentage of nitrogenous bases on a DNAmolecule that are either guanine or cytosine. In some cases, the methodsfurther comprise whole genome sequencing of single reads or less, or forexample, less than about 5 gigabases. In some cases, the methods furthercomprise whole genome sequencing of paired-end reads. In some cases, themethods further comprise whole genome sequencing of mate-pare reads.

Disclosed herein are computer-implemented methods or methods foranalyzing a nucleic acid sample. The methods of the disclosure cancomprise (a) receiving a first data input, the first data inputcomprising untargeted sequencing data generated from a first nucleicacid sample; (b) receiving a second data input, the second data inputcomprising target-specific sequencing data generated from a secondnucleic acid sample; (c) combining and/or analyzing, with the aid of acomputer processor, the first data input and the second data input toproduce combined data and/or analysis; and (d) generating, with the aidof a computer processor, an output derived from the combined data and/oranalysis. In some cases, the first nucleic acid sample and the secondnucleic acid sample are obtained from the same source or the samesubject. The output can be indicative of, or comprise a detection of,the presence or absence of one or more polymorphisms of the firstnucleic acid sample, the second nucleic acid sample, and/or the genomeof the subject.

In an aspect of the present disclosure, systems for genomic sequencingor the computer-implemented methods are provided. The system cancomprise (a) one or more memory locations comprising a first data inputand second data input, wherein the first data input comprises untargetedsequencing data and the second data input comprises target-specificsequencing data of or related to a genome of a subject or a portionthereof; (b) a computer processor operably coupled to the one or morememory locations, wherein the computer processor is programmed tocombine and/or analyze the first data input and the second data input toproduce a combined data set and/or analysis, and generates an outputfrom the combined data set and/or analysis, which output is indicativeof the presence or absence of one or more polymorphisms in the genome ofthe subject or the portion thereof. In some cases, the systems canfurther comprise (c) an electronic display, coupled to the computerprocessor, wherein the electronic display presents an output derivedfrom at least a portion of the combined data set and/or analysis innumeric and/or graphical form. The output can comprise detection of oris indicative of presence or absence of one or more polymorphisms.

Also provided herein are computer-implemented methods comprising (a)receiving, by a computer, a data input, the data input comprising a datafrom a target-specific sequencing data from a targeted sequencingmethod, generated from a nucleic acid sample; (b) annotating, by thecomputer or with the aid of a computer processor, the data as targetedor untargeted (e.g., exonic or non-exonic); and (c) generating, by thecomputer, an output comprising analysis of the untargeted or non-exonicdata. The output can comprise detection of or is indicative of one ormore polymorphisms in the nucleic acid sample.

Also disclosed herein are systems for sequencing comprising (a) one ormore memory locations comprising a data input, wherein the data inputcomprises sequencing data generated by a target-specific sequencingmethod; (b) a computer processor coupled to the one or more memorylocation, wherein the computer processor is programmed to annotate thedata input as targeted or untargeted (i.e., exonic or non-exonic)sequencing data, analyze the untargeted sequencing data, and generate anoutput based on the annotation and/or the analysis of the untargetedinput data. In some cases, the output comprises detection of one or morepolymorphisms, or the output is indicative of one or more polymorphisms.

Types of Data

In some embodiments, the computer-implemented methods and systems forgenome sequencing or sequencing may include data that is generated byuntargeted sequencing or a target-specific sequencing assay or method.The target-specific sequencing can be a subset of the genome that is aregion of interest (e.g., to a user). Non-limiting examples includeexome, a particular chromosome, a set of genes, or genomic regions.Targeted sequencing methods can allow one to selectively capture genomicregions of interest from a nucleic acid sample prior to sequencing.Targeted sequencing involves alternate methods of sample preparationthat produce libraries that represent a desired subset of the genome orto enrich the desired subset of the genome. In some cases, this subsetis the exome, which can be functionally important and therefore can be ahigh candidate target for medical/gene-related research. By targetingthe exome of an individual, it is possible to identify known geneticvariants that could promote a disease phenotype. Additionally, bytargeting the exomes of multiple patients, rare variants can be found,and further analysis on the functional consequences of the mutation canbe completed. Exome sequencing can use either a ‘solution-based capture’or ‘microarray capture’ method. The array-based method can be used whenthe target design may only be used across a small number of samples (upto 20 or so). Studies that focus on even smaller regions of the genomemay also employ polymerase chain reaction (PCR) based approaches. Somemethods for exome sequencing are described in Bainbridge, M. et al.,(2010) Whole exome capture in solution with 3 Gbp of data. GenomeBiology 11:R62, Kiialainen, A. et al. (2011) Performance of Microarrayand Liquid Based Capture Methods for Target Enrichment for MassivelyParallel Sequencing and SNP Discovery. PLoS ONE 6(2):e16486, or Tewhey,R. et al. (2009) Microdroplet-based PCR enrichment for large-scaletargeted sequencing. Nature Biotechnology 27:1025-1031, herebyincorporated by reference in their entirety.

Untargeted sequencing can be a sequencing method that does not target orenrich a region of interest in a nucleic acid sample. The untargeted orcomprehensive sequencing may be whole genome sequencing or wholetranscriptome sequencing. The untargeted sequence can be the nucleicacid sequences that are not in a region of interest or sequence datathat is generated by a sequencing method that does not target or enricha region of interest in a nucleic acid sample. The untargeted sequencecan be a sequence that is outside of a region of interest. In somecases, sequencing data that is generated by a targeted sequencing methodcan comprise not only targeted sequences but also untargeted sequences.

The types of target-specific sequencing may be whole exome sequencing,RNA sequencing, DNA sequencing, or targeted sequencing of one or morespecific genes or genomic regions, or a combination thereof. Thespecific genes or genomic regions may be indicative of any specificpathways or specific disorders, such as a genetic disorder or singlenucleotide polymorphism. The genomic regions may comprise one or morepolymorphisms, sets of genes, sets of regulatory elements,micro-deletions, homopolymers, simple tandem repeats, regions of high GCcontent, regions of low GC content, paralogous regions or a combinationthereof. Utilizing the above sequencing assays allows for untargetedsequencing of the sample or targeted sequencing of the sample. Inuntargeted sequencing, such as whole genome sequencing or wholetranscriptome sequencing, the entire DNA or RNA structure is examined.In targeted sequencing assays, only targeted or specific portions of theDNA or RNA are intended to be sequenced. A variety of differentsequencing methods and assays can be found in U.S. Patent PublicationNo. 2013/0178389 A1, hereby incorporated by reference in its entirety.In some embodiments, the sequencing data may comprise sequenceinformation of at least about 1; 2; 3; 4; 5; 6; 7; 8; 9; 10; 15; 20; 25;30; 35; 40; 45; 50; 60; 70; 80; 90; 100; 120; 140; 160; 180; 200; 240;280; 300; 350; 400; 450; 500; 600; 700; 800; 1000; 1500; 2000; 2500;3000; 3500; 4000; 5000; 7500; 10,000; 15,000; 20,000; 30,000; 45,000;50,000; 60,000; 100,000; 200,000; 400,000; 600,000; 1 million; 1.5million; 2 million or more genomic markers or genes.

Methods and systems as described herein can comprise combining and/oranalyzing a first data input and a second data input to generate anoutput derived from the combined data and/or analysis. In some cases,the combining and/or analyzing can further comprise combining and/oranalyzing one or more different data inputs. In some cases, the methodsand systems can comprise combining and/or analyzing data inputs from oneor more sequencing data. The one or more sequencing data can be the sametype or different. The one or more sequencing data can be partialsequencing data or complete sequencing data of any kind. The one or moresequencing data can be target-specific sequencing data. In some cases,methods and systems as described herein can comprise combining and/oranalyzing an untargeted sequencing data such as whole genome sequencingdata with one or more target-specific sequencing data. The untargetedsequencing data can comprise entirely the non-specific or untargetedportion of the target-specific sequencing data. For example, thenon-exonic portion of a whole exome sequencing data can be used as theuntargeted sequencing data in methods and systems as described herein.In some cases, a portion of the untargeted sequencing data can comprisethe non-specific or untargeted portion of the target-specific sequencingdata. The one or more target-specific sequencing data can be the sametype or different. For example without limitation, the methods andsystems can comprise combining and/or analyzing whole genome sequencingdata, whole exome sequencing data with specific gene sequencing datatargeting a specific genetic disorder. Each of the one or more datainputs can be from the same or different nucleic acid samples.

Methods and systems described herein can comprise receiving a data inputand analyzing and/or annotating the data input based on one or moreparameters. The parameters can be related to the untargeted or targetedportion of the data input. For example without limitations, methods andsystems as described herein can comprise receiving a data inputcomprising target-specific sequencing data (i.e., whole exome sequencingdata) generated from a nucleic acid sample. For example withoutlimitation, the methods and systems as described herein can compriseanalyzing and/or annotating the data pertaining to the first nucleicacid sample as targeted or untargeted, such as exonic or non-exonic. Themethods and systems as described herein may further comprise generatingan output comprising a subset of the data input such as the untargetedportion of the data, or for example without limitation, the non-exonicdata. The output can further comprise the targeted portion of the datasuch as exonic data along with the annotated untargeted portion of thedata such as non-exonic data. In some cases, the methods and systems asdescribed herein can further comprise receiving a data input comprisinga target-specific sequencing data such as whole exome sequencing dataand one or more additional data inputs comprising sequencing datagenerated from one or more nucleic acid samples. The one or moreadditional data inputs can be generated by any sequencing methods suchas untargeted sequencing, target-specific sequencing, or a combinationthereof. Similarly, the one or more additional data inputs can beannotated based on one or more parameters. Output comprising at least aportion or a subset of the one or more additional data can also begenerated. The methods and systems as described herein can furthercomprise combining and/or analyzing one or more subsets of the one ormore additional data inputs with a subset of the first data input. Themethods and systems can further comprise generate an output comprisingthe combined data and/or analysis. The methods and systems as disclosedherein can further comprise generating one or more biomedical reportsbased on the output. The medical reports may be used for determining,administering, or modifying a therapeutic regimen for a subject.

In some cases, the combined data or analysis can be generated by acomputer system, as described elsewhere herein. The combined data can bedisplayed electronically. At least a portion of the combined data can bepresented in numeric and/or graphical form. In some cases, the methodsand systems as described herein can comprise combining and/or analyzingdata from at least 2, 3, 4, 5, 6, 7, 8, 9 or 10 different or samesources. In some cases, the display or the combined data is based ondata from one or more databases or data sources. The sources of data ordatabases can be any types of databases that are suitable for themethods and systems as described herein. In some embodiments, thedatabases or data sources can comprise one or more medical records,clinical notes, genomic databases, variant database, biomedicaldatabase, clinical databases, scientific databases, disease databases,biomarker databases, and the like. The databases or data sources can bepublicly-available or proprietary. In some embodiments, the databases ordata sources can comprise proprietary databases. The proprietarydatabases can be a database that is specific to one or more diseases.For example, without limitation, the proprietary database can be avariant database or a pharmacogenomics database. The publicly-availabledatabases can comprise Orphanet, Human Phenotype Ontology (HPO), OnlineMendelian Inheritance in Man (OMIM), Model Organism Gene Knock-Outdatabases, Kegg Disease Database, dbSNP, and the like.

Untargeted Sequencing

Another aspect provides computer-implemented methods and systems forgenomic sequencing or other sequencing. The methods can provide sequenceinformation regarding one or more polymorphisms, sets of genes, sets ofregulatory elements, micro-deletions, homopolymers, simple tandemrepeats, regions of high GC content, regions of low GC content,paralogous regions, or a combination thereof. In some cases, theuntargeted sequencing can be whole genome sequencing. In some cases, theuntargeted sequencing data can be the untargeted portion of the datagenerated from a target-specific sequencing assay. The methods cangenerate an output comprising a combined data set comprisingtarget-specific sequencing data and a low coverage untargeted sequencingdata as supplement to target-specific sequencing data. Non-limitingexamples of the low coverage untargeted sequencing data include lowcoverage whole genome sequencing data or the untargeted portion of thetarget-specific sequencing data. This low coverage genome data can beanalyzed to assess copy number variation or other types of polymorphismof the sequence in the sample. The low coverage untargeted sequencing(i.e., single run whole genome sequencing data) can be fast andeconomical, and can deliver genome-wide polymorphism sensitivity inaddition to the target-specific sequencing data. In addition, variantsdetected in the low coverage untargeted sequencing data can be used toidentify known haplotype blocks and impute variants over the wholegenome with or without targeted data.

Untargeted sequencing (i.e., whole genome sequencing) can determine thecomplete DNA sequence of the genome at one time. Untargeted sequencing(i.e., whole genome sequencing or the non-exonic portion of whole exomesequencing) can cover sequences of almost about 100 percent, or about95%, of the sample's genome. In some cases, the untargeted sequencing(i.e., whole genome sequencing or non-exonic portion of the whole exomesequencing) can cover sequences of the whole genome of the nucleic acidsample of about or at least about 99.999%, 99.5%, 99%, 98%, 97%, 96%,95%, 94%, 93%, 92%, 91%, 90%, 89%, 88%, 87%, 86%, 85%, 84%, 83%, 82%,81%, 80%, 79%, 78%, 77%, 76%, 75%, 74%, 73%, 72%, 71%, 70%, 69%, 68%,67%, 66%, 65%, 64%, 63%, 62%, 61%, 60%, 59%, 58%, 57%, 56%, 55%, 54%,53%, 52%, 51%, or 50%.

In some cases, the output can have a coverage of about or at least about99.999%, 99.5%, 99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 89%,88%, 87%, 86%, 85%, 84%, 83%, 82%, 81%, 80%, 79%, 78%, 77%, 76%, 75%,74%, 73%, 72%, 71%, 70%, 69%, 68%, 67%, 66%, 65%, 64%, 63%, 62%, 61%,60%, 59%, 58%, 57%, 56%, 55%, 54%, 53%, 52%, 51%, or 50% of the wholegenome of the nucleic acid sample from a subject.

In some cases, the computer-implemented methods and systems receives afirst data input comprising untargeted sequencing data generated from afirst nucleic acid sample. The untargeted sequencing data can be wholegenome sequencing data or the untargeted portion of a target-specificsequencing data. The whole genome sequencing data can be a low coveragewhole genome sequencing data. The untargeted sequencing data cancomprise coverage of about or at least about 5, 4.5, 4, 3.5, 3, 2.5, 2,1.5, or 1 gigabases; about or at least about 900, 850, 800, 750, 700,650, 600, 550, 500, 450, 400, 350, 300, 250, 200, 150, 100, 90, 80, 70,60, 55, 50, 45, 40, 35, 30, 25, 20, 15, 10, 5, 4, 3, 2, or 1 megabases.The untargeted sequencing data can comprise coverage of less than about500, 450, 400, 350, 300, 250, 200, 190, 180, 170, 160, 150, 140, 130,120, 110, 100, 95, 90, 85, 80, 75, 70, 65, 60, 55, 50, 45, 40, 35, 30,25, 20, 15, 10, 9, 8, 7, 6, 5, 4, 3, 2, or 1 gigabases. The untargetedsequencing data can comprise between about 1-5, 0.5-10, 1-10, 0.5-5,0.1-5, 2-5, 0.1-2, or 0.1-1 gigabases.

The untargeted sequencing data (i.e., whole genome sequencing data orthe untargeted portion of a target-specific sequencing data) can beanalyzed by assigning the data into a plurality of genomic bins andmeasuring the number of sequence reads in each of a plurality of genomicbins. The genomic bins can have different sizes comprising differentnumbers of basepairs to assess genomic regions such as the one or morepolymorphisms or copy number variations (CNV) of the sequence of thesample. The genomic bin size should be selected to balance the tradeoff,for example without limitation, between sensitivity to small CNVs andreduction of false positive detections. In some embodiments, the size ofgenomic bins can be between about 100 to about 1,000,000 basepairs. Insome embodiments, the size of genomic bins can be about, less thanabout, or at least about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30,35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 150, 200, 250,or 300 megabasepairs; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35,40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 150, 200, 250, 300,350, 400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, or 990kilobasepairs; or 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70,75, 80, 85, 90, 95, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550,600, 650, 700, 750, 800, 850, 900, 950, or 990 basepairs. In someembodiments, the size of the genomic bins can be higher than 300megabasepairs. In some embodiments, the size of the genomic bins can bethe size of a chromosome or the average size of chromosomes in asubject. In some embodiments, the size of genomic bins can be about 100basepairs. In some embodiments, the size of genomic bins can be about 1kilobasepairs. In some embodiments, the size of genomic bins can bebetween about 1 kilobasepairs to 20 kilobasepairs. In some embodiments,size of genomic bins can be between about 100-0.5 million, 100-1.5million, 300-0.5 million, 300-1 million, 300-1.5 million, 500-0.5million, 500-1 million, or 500-1.5 million basepairs.

In some cases, the untargeted sequencing (i.e., whole genome sequencingof one single read) covering about 3 gigabasepairs can deliver agenome-wide structural variation (SV) sensitivity from 50 kilobasepairsupwards. In some embodiments, the untargeted sequencing (i.e., wholegenome sequencing) can result in a genome-wide structural variationsensitivity from 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50,60, 70, 80, 90, or, 100 kilobasepairs upwards. In some embodiments, theuntargeted sequencing (i.e., whole genome sequencing) can result in agenome-wide structural variation sensitivity from less than about 10,20, 30, 40, 50, 60, 70, 80, 90 or100 kilobasepairs upwards. In addition,variants detected in the untargeted sequencing data can be used toidentify known haplotype blocks and impute variants over the wholegenome with or without targeted sequencing data such as whole exomesequencing data.

In some embodiments, the whole genome sequencing data can comprisesingle reads. Costs and time for whole genome sequencing can be loweredby eliminating the second or third reads. In some embodiments, the wholegenome sequencing data can comprise less than single reads. For example,the whole genome sequencing data can comprise 0.1 to 1 reads, or 0.5 to1 reads. In some embodiments, the whole genome sequencing data cancomprise about, less than about, or at least about 0.1, 0.2, 0.25, 0.3,0.4, 0.5, 0.6, 0.7, 0.75, 0.8, 0.9, 0.95, or 0.99 reads. In someembodiments, the whole genome sequencing data can further comprisesecond reads. In some embodiments, the whole genome sequencing data canfurther comprise third reads. In some embodiments, the whole genomesequencing data can further comprise 4^(th), 5^(th), 6^(th), 7^(th),8^(th), 9^(th), 10^(th), 11^(th), 12^(th), 13^(th), 14^(th), 15^(th),16^(th), 17^(th), 18^(th), 19^(th), 20^(th) 21^(st), 22^(nd), 23^(th),24^(th), 25th, 26^(th), 27^(th), 28^(th), 29^(th), 30^(th), 31^(st),32^(nd), 33^(rd), 34^(th), 35^(th), 36^(th), 37^(th), 38^(th), 39^(th),40^(th), 41^(st), 42^(nd), or 43^(rd) reads. In some embodiments, thewhole genome sequencing data can further comprise less than 1, 2, 3, 4,5, 10, 15, 20, 25, 30, 35, 40, 45, or 50 reads. Multiple reads mayprovide additional structural variation detection modes, e.g., anomalousread separation. However, the whole genome sequencing data comprisingsingle reads can assess large copy number variations using thealgorithms described herein in the instant invention.

Target-Specific Sequencing

Target-specific sequencing is selective sequencing of specific genomicregions, specific genes, or whole exome sequencing. Non-limitingexamples of the genomic regions include one or more polymorphisms, setsof genes, sets of regulatory elements, micro-deletions, homopolymers,simple tandem repeats, regions of high GC content, regions of low GCcontent, paralogous regions, degenerate-mapping regions, or acombination thereof. The sets of genes or regulatory elements can berelated to one or more specific genetic disorders of interest. The oneor more polymorphisms can comprise one or more single nucleotidevariations (SNVs), copy number variations (CNVs), insertions, deletions,structural variant junctions, variable length tandem repeats, or acombination thereof.

In some cases, the target-specific sequencing data can comprisesequencing data of some untargeted regions. One example of thetarget-specific sequencing is the whole exome sequencing. Whole exomesequencing is target-specific or selective sequencing of coding regionsof the DNA genome. The targeted exome is usually the portion of the DNAthat translates into proteins, or namely exonic sequence. However,regions of the exome that do not translate into proteins may also beincluded within the sequence, namely non-exonic sequences. Non-exonicsequences are usually not included in exome studies. In the human genomethere can be about 180,000 exons: these can constitute about 1% of thehuman genome, which can translate to about 30 megabases (Mb) in length.It can be estimated that the protein coding regions of the human genomecan constitute about 85% of the disease-causing mutations. The robustapproach to sequencing the complete coding region (exome) can beclinically relevant in genetic diagnosis due to the currentunderstanding of functional consequences in sequence variation, byidentifying the functional variation that is responsible for bothmendelian and common diseases without the high costs associated with ahigh coverage whole-genome sequencing while maintaining high coverage insequence depth. Other aspect of the exome sequencing can be found in NgSB et al., “Targeted capture and massively parallel sequencing of 12human exomes,” Nature 461 (7261): 272-276 and Choi M et al., “Geneticdiagnosis by whole exome capture and massively parallel DNA sequencing,”Proc Natl Acad Sci USA 106 (45): 19096-19101.

In some embodiments, the methods and systems as described hereincomprise receiving, combining and/or analyzing target-specificsequencing (i.e., whole exome sequencing) data and untargeted sequencingdata (i.e., low-coverage whole genome sequencing data). The methods andsystems as described herein can assess copy number variation or othertypes of polymorphisms of the genome, and deep coverage of thefunctional consequences in sequence variation. In some embodiments, thetarget-specific sequencing (i.e., whole exome sequencing) dataconstitute about 1% of the human genome. In some embodiments, thetarget-specific sequencing (i.e., whole exome sequencing) dataconstitute about, at least about, or less than about 0.00001%, 0.00002%,0.00003%, 0.00004%, 0.00005%, 0.00006%, 0.00007%, 0.00008%, 0.00009%,0.0001%, 0.0002%, 0.0003%, 0.0004%, 0.0005%, 0.0006%, 0.0007%, 0.0008%,0.0009%, 0.001%, 0.002%, 0.003%, 0.004%, 0.005%, 0.006%, 0.007%, 0.008%,0.009%, 0.01%, 0.02%, 0.03%, 0.04%, 0.05%, 0.06%, 0.07%, 0.08%, 0.09%,0.1%, 0.2%, 0.3%, 0.4%, 0.5%, 0.6%, 0.7%, 0.8%, 0.9%, 1%, 2%, 3%, 4%,5%, 6%, 7%, 8%, 9%, or 10% of the human genome.

In some cases, the target-specific sequencing data can comprise sequencedata of about 180,000 exons. In some embodiments, the target-specificsequencing data can comprise sequence data of about, less than about, atleast about 1000, 5000, 10,000, 15,000, 20,000, 25,000, 30,000, 35,000,40,000, 45,000, 50,000, 55,000, 60,000, 65,000, 70,000, 75,000, 80,000,85,000, 90,000, 95,000, 100,000, 105,000, 110,000, 115,000, 120,000,125,000, 130,000, 135,000, 140,000, 145,000, 150,000, 155,000, 160,000,165,000, 170,000, 175,000, or 180,000 exons. In some embodiments, thetarget-specific sequencing data can comprise sequence data of about, atleast about, less than about 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%,90%, 95%, 99%, 99.5%, 99.9%, or 100% exons of the total exons in thewhole genome.

In some embodiments, the target-specific sequencing data can compriseabout 30 megabasepairs of sequences. In some embodiments, thetarget-specific sequencing data can comprise about, at least about, lessthan about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45,50, 55, 60, 65, 70, 75, 80, 85, 90, 95, or 100 megabasepairs; or 1, 2,3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70,75, 80, 85, 90, 95, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550,600, 650, 700, 750, 800, 850, 900, 950 or 990 kilobasepairs.

The target-specific sequencing techniques used in the methods of theinvention may generate at least 10 reads per run, at least 50 reads perrun, at least 100 reads per run, at least 200 reads per run, at least300 reads per run, at least 400 reads per run, at least 500 reads perrun, at least 600 reads per run, at least 700 reads per run, at least800 reads per run, at least 900 reads per run, at least 1000 reads perrun, at least 5,000 reads per run, at least 10,000 reads per run, atleast 50,000 reads per run, at least 100,000 reads per run, at least500,000 reads per run, or at least 1,000,000 reads per run.Alternatively, the target-specific sequencing technique used in themethods of the invention generates at least 1,500,000 reads per run, atleast 2,000,000 reads per run, at least 2,500,000 reads per run, atleast 3,000,000 reads per run, at least 3,500,000 reads per run, atleast 4,000,000 reads per run, at least 4,500,000 reads per run, or atleast 5,000,000 reads per run.

In some embodiments, the target-specific sequencing data may comprisesequence information of 1; 2; 3; 4; 5; 6; 7; 8; 9; 10; 15; 20; 25; 30;35; 40; 45; 50; 60; 70; 80; 90; 100; 120; 140; 160; 180; 200; 240; 280;300; 350; 400; 450; 500; 600; 700; 800; 1000; 1500; 2000; 2500; 3000;3500; 4000; 5000; 7500; 10,000; 15,000; 20,000; 30,000; 45,000; 50,000;60,000; 100,000; 200,000; 400,000; 600,000; 1 million; 1.5 million; 2million or more genomic DNA markers or genes.

In some embodiments, the target-specific sequencing data (i.e., wholeexome sequencing data or the protein coding regions of the human genomeor disease-targeted specific sets of genes) can constitute about 85% ofthe disease-causing mutations. In some embodiments, the target-specificsequencing data can comprise sequence information that can constituteabout, at least about, less than about 10%, 15%, 20%, 25%, 30%, 35%,40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 99%, or 100%of the disease-causing mutations. The disease-causing mutations cancorrespond to at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32,33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,55, 60, 65, 70, 75, 80, 85, 90, 95, or 100 different diseases orconditions.

In some cases, the non-exonic data from the exome sequencing data cancover a large portion of the whole genome and can detect copy numbervariations and other polymorphisms in the genome. In some cases, themethods and systems as described herein can comprise annotating, by acomputer or other means, the target-specific sequencing data (i.e.,exome sequencing data) pertaining to a nucleic acid sample as targetedor untargeted (i.e., exonic or non-exonic). In some cases, the methodsand systems as described herein can further comprise generating, by acomputer or other means, an output comprising analysis of the untargeteddata (i.e., non-exomic data). The untargeted data (i.e., non-exomicdata) can be used in some embodiments of the instant invention, toprovide access to or detect copy number variations or other types ofpolymorphisms or genomic regions of the whole genome. The methods andsystems as described herein can further comprise combining and/oranalyzing the untargeted data annotated from the target-specificsequencing data (i.e., non-exonic data) and targeted data such as exonicdata or other target-specific sequencing data to generate an output. Thenon-exonic data can have coverage of the genome similar to that of a lowcoverage whole genome sequencing, and as a result, it can be performedto detect copy number variations or other types of polymorphisms orgenomic regions of the genome. The untargeted data such as non-exonicdata can be annotated from the target-specific sequencing data such aswhole exome sequencing data using any methods well known in the art, forexample without limitation, the methods described in Guo Y et al.,“Exome sequencing generates high quality data in non-target regions,”BMC Genomics. 2012 May 20; 13:194 and Asan et al., “Comprehensivecomparison of three commercial human whole-exome capture platforms,”Genome Biol. 2011 Sep. 28; 12(9).

In some cases, the untargeted data from the target-specific sequencing(i.e., non-exonic sequencing data) can comprise coverage of about or atleast about 5, 4.5, 4, 3.5, 3, 2.5, 2, 1.5, or 1 gigabases; about or atleast about 900, 850, 800, 750, 700, 650, 600, 550, 500, 450, 400, 350,300, 250, 200, 150, 100, 90, 80, 70, 60, 55, 50, 45, 40, 35, 30, 25, 20,15, 10, 5, 4, 3, 2, or 1 megabases. The untargeted data from thetarget-specific sequencing (i.e., non-exonic sequencing data) cancomprise coverage of less than about 900, 850, 800, 750, 700, 650, 600,550, 500, 450, 400, 350, 300, 250, 200, 150, 100, 90, 80, 70, 60, 55,50, 45, 40, 35, 30, 25, 20, 15, or 10 megabases. The untargeted datafrom the target-specific sequencing (i.e., non-exonic sequencing data)can comprise between about 1-5, 0.5-10, 1-10, 0.5-5, 0.1-5, 0.1-2, or0.1-1 gigabases.

In some cases, the untargeted data from the target-specific sequencing(i.e., non-exonic sequencing data) can comprise about 50% or more of thetarget-specific sequencing data such as whole exome sequencing data. Insome cases, the untargeted data from the target-specific sequencing(i.e., non-exonic sequencing data) can comprise about 10% or more of thetarget-specific sequencing data such as whole exome sequencing data. Insome embodiments, the untargeted data from the target-specificsequencing (i.e., non-exonic sequencing data) can comprise about 5%, 6%,7%, 8%, 9%, 10%, 11%, 12%, 13%, 14%, 15%, 20%, 25%, 30%, 35%, 40%, 45%,50%, 55%, 60%, 65%, 70%, 75%, or more of the target-specific sequencingdata such as whole exome sequencing data.

Output Data

Methods and systems as disclosed herein can comprise generating anoutput data. The output data can comprise any sequencing informationdata that is generated by methods as described elsewhere in thisspecification. For example without limitation, the output can comprisewhole genome sequencing data, whole exome sequencing data, othertarget-specific sequencing data, a subset of a sequencing data,non-exonic data, exonic data or a combination thereof. The one or moreoutputs, sets of outputs, combined outputs, and/or combined sets ofoutputs may comprise one or more biomedical reports, biomedical outputs,rare variant outputs, pharmacogenetic outputs, population study outputs,case-control outputs, biomedical databases, genomic databases, diseasedatabases,or net content.

The methods and systems as described herein can comprise generating anoutput comprising detections of one or more genomic regions selectedfrom copy number variations, single nucleotide variations, structuralvariants, one or more polymorphisms, sets of genes, sets of regulatoryelements, micro-deletions, homopolymers, simple tandem repeats, regionsof high GC content, regions of low GC content, paralogous regions,degenerate-mapping regions, or a combination thereof

The target-specific sequencing data may be based on genomic regions thatcomprise one or more polymorphisms, sets of genes, sets of regulatoryelements, micro-deletions, homopolymers, simple tandem repeats, regionsof high GC content, regions of low GC content, paralogous regions,degenerate-mapping regions, or a combination thereof. This sequencingmay take the form of mutational analysis for one or more polymorphismssuch as single nucleotide polymorphism (SNP) analysis or singlenucleotide variations (SNV), insertion deletion polymorphism (InDel)analysis, variable number of tandem repeat (VNTR) analysis, copy numbervariation (CNV) analysis (alternatively referred to as copy numberpolymorphism), partial or whole genome sequencing, or combinationthereof. The types of polymorphisms that can be analyzed can be anypolymorphism that is known in the art. The types of polymorphism may beinsertions, deletions, structural variant junctions, variable lengthtandem repeats, single nucleotide mutations, single nucleotidevariations, copy number variations, or a combination thereof. Inpreferred embodiments, the polymorphism is copy number variation orsingle nucleotide variant. Methods for performing genomic analyses areknown to the art and may include high throughput sequencing such as butnot limited to those methods described in U.S. Pat. Nos. 7,335,762;7,323,305; 7,264,929; 7,244,559; 7,211,390; 7,361,488; 7,300,788; and7,280,922, each of which is entirely incorporated herein by reference.Methods for performing genomic analyses may also include microarraymethods as described hereinafter.

Algorithms and Models

In some cases, the methods and systems as described herein are used togenerate an output comprising detection and/or quantitation of genomicDNA regions such as a region containing a DNA polymorphism. In somecases, the detection of the one or more genomic regions is based on oneor more algorithms, depending on the source of data inputs or databasesthat are described elsewhere in the instant specification. Each of theone or more algorithms can be used to receive, combine and generate datacomprising detection of genomic regions (i.e., polymorphisms). In someembodiments, the instant method and system can comprise detection of thegenomic regions that is based on one or more, two or more, three ormore, four or more, five or more, six or more, seven or more, eight ormore, nine or more or ten or more algorithms. The algorithms can bemachine-learning algorithms, computer-implemented algorithms,machine-executed algorithms, automatic algorithms and the like.

The resulting data for each nucleic acid sample can be analyzed usingfeature selection techniques including filter techniques which assessthe relevance of features by examining the intrinsic properties of thedata, wrapper methods which embed the model hypothesis within a featuresubset search, and embedded techniques in which the search for anoptimal set of features is built into an algorithm or model.

In some situations, an assay is used to generate a first set of datacontaining at least some or all variants (e.g., single nucleotidevariations or insertion deletion polymorphism) from the exome of anucleic sample of a subject, and another assay is used to generate asecond set of data that contains at least some or all variants (e.g.,structural variants or copy number variations) from the whole genome ofthe nucleic sample of the subject. The first set of data and second setof data can be combined into a combined output. In some cases, variantsfrom the first set of data and variants from the second set of data arecombined into the combined output. The assays can be nucleic acidsequencing or nucleic acid amplification (e.g., PCR) with two differentsample preparation methods that can be directed to identifying variantsin the exome and variants in the genome. For example, variants in theexome can be identified using target-specific sequencing (e.g., usingtarget specific primers), and variants in the whole genome can beidentified using untargeted sequencing.

In some cases, the detection of the one or more genomic regions is basedon one or more statistical models. Statistical models or filteringtechniques useful in the methods of the present invention include (1)parametric methods such as the use of two sample t-tests, ANOVAanalyses, Bayesian frameworks, and Gamma distribution models (2) modelfree methods such as the use of Wilcoxon rank sum tests, between-withinclass sum of squares tests, rank products methods, random permutationmethods, or TNoM which involves setting a threshold point forfold-change differences in expression between two datasets and thendetecting the threshold point in each gene that minimizes the number ofmissclassifications (3) and multivariate methods such as bivariatemethods, correlation based feature selection methods (CFS), minimumredundancy maximum relavance methods (MRMR), Markov blanket filtermethods, Markov models, Hidden Markov Model (HMM), and uncorrelatedshrunken centroid methods. In some cases, the Hidden Markov Model (HMM)is given an internal state, wherein the internal state is set accordingto an overall copy number of a chromosome in the first or second nucleicacid sample. In an instance, for a diploid chromosome, the HMM'sinternal states can be homozygous deletion (locally zero copies),heterozygous deletion (locally one copy), normal (locally two copies),duplication (more than two copies), and reference Gap (present as astate to distinguish gaps from Homozygous deletions). In anotherinstance, for a Haploid chromosome (e.g., X or Y in a male), the HMM'sinternal states can be homozygous deletion (locally zero copies), normal(locally two copies), duplication (more than two copies), and referenceGap (present as a state to distinguish gaps from Homozygous deletions).For example, for a Haploid chromosome, there may be no heterozygousdeletion state available. In another instance, for trisomic and/ortetrasomic, additional intermediate the HMM states may have anadditional intermediate state, wherein the intermediate state canaccount for the various CNV possibilities. In another embodiment, theHidden Markov Model is used to filter the output by examination ofmeasured insert-sizes of reads near a detected feature's breakpoint(s).Other models or algorithms useful in the methods of the presentinvention include sequential search methods, genetic algorithms,estimation of distribution algorithms, random forest algorithms, weightvector of support vector machine algorithms, weights of logisticregression algorithms, and the like. Bioinformatics. 2007 Oct. 1;23(19):2507-17 provides an overview of the relative merits of thealgorithms or models provided above for the analysis of data.Illustrative algorithms include but are not limited to methods thatreduce the number of variables such as principal component analysisalgorithms, partial least squares methods, independent componentanalysis algorithms, methods that handle large numbers of variablesdirectly such as statistical methods, and methods based on machinelearning techniques. Statistical methods include penalized logisticregression, prediction analysis of microarrays (PAM), methods based onshrunken centroids, support vector machine analysis, and regularizedlinear discriminant analysis. Machine learning techniques includebagging procedures, boosting procedures, random forest algorithms, andcombinations thereof. Cancer Inform. 2008; 6: 77-97 provides an overviewof the techniques provided above for the analysis of data.

In some embodiments, an HMM-based detection algorithm can “segmentally”detect a large or substantially large CNV. In some cases, due tofluctuations in the coverage signal, there may be small detection gapsalong the length of the true CNV. In an example, a 1 megabasepairs (Mbp)deletion may be detected as a small number of separate nominaldetections, with small gaps between them. To mitigate this, a mergeoperation can be employed that identifies pairs of adjacent detectionswhich are separated by a gap that is smaller than either of the twobracketing detections. The merge operation then measures the mediancoverage level in the gap. If the median coverage passes a predefinedthreshold, then the two detections are merged into a single largedetection that spans the two original detections (including the encloseddetection gap). In an example, the true feature spans both detections,and the gap is a statistical artifact. Using real sequencing data ofsamples that are known to have large CNVs, this merge operation canpermit a substantially better fidelity with respect to the trueproperties of the CNVs.

In some embodiments, the untargeted sequencing data (e.g., whole genomesequencing data) comprises paired-end reads. In some embodiments, theuntargeted sequencing data (e.g., whole genome sequencing data)comprises mate-pair reads. The paired-end reads and/or the mate-pairreads may comprise insert-sizes greater than or equal to about 0.1, 0.2,0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6,1.7, 1.8, 1.9, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0,13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 30.0, 40.0, 50.0, 60.0,70.0, 80.0, 90.0, or 100.0 kilobasepairs. In some embodiments, whenpaired-end reads and/or mate-pair reads are used, DNA or RNA moleculesare digested into small inserts. The inserts may be ligated withadaptors on both ends. An insert-size may be the size of the DNA or RNAinserted between the adaptors. The adaptors may enable amplification andsequencing of the DNA or RNA. In an example, the paired-end reads mayhave an insert-size of larger than about 1 kilobasepairs. In anotherexample, the mate-pair reads may have an insert-size of larger thanabout 2 kilobasepairs.

The paired-end reads and/or mate-pair reads may sequence 100 basepairs(bp) reads on both ends of the insert. For example, when an untargetedsequencing data (e.g. whole genome sequencing data) is comprised ofeither paired-end reads with an insert size of larger than 1kilobasepairs (kbp) or mate pairs with a separation of larger than 2kbp, then even with low (˜1 x) read depth, there may be at least about10 molecules spanning any particular position on the genome, which maybe sufficient to provide corroborating evidence of large CNVs detectedby methods and systems presented herein. In another example, methods andsystems of the present disclosure, such as the HMM method, can be usedto detect a 50 kbp heterozygous deletion. When paired-end reads areused, about half of the paired-end molecules that span the detectionbreakpoints may have insert-sizes that are 50 kbp larger than normal.The insert-sizes may appear larger because they may be mapped to areference with the sequence segment that is deleted in the sample. Insome examples, when 1 kbp insert sizes are used, then 10 reads span thebreakpoint, 5 of which can have anomalous insert size. Five reads mayprovide sufficient statistical power and information to generateaccurate and/or reliable results. In some cases, if an insert-size ofonly about 300 basepairs is used, then there may be only about 3 readsthat span the breakpoint on average, and half of that may be about 1read, which may not provide sufficient statistical power and informationto generate accurate and/or reliable results.

Methods and systems provided herein may further include the use of afeature selection algorithm as provided herein. In some embodiments ofthe present invention, feature selection is provided by use of the LIMMAsoftware package (Smyth, G. K. (2005). Limma: linear models formicroarray data. In: Bioinformatics and Computational Biology Solutionsusing R and Bioconductor, R. Gentleman, V. Carey, S. Dudoit, R.Irizarry, W. Huber (eds.), Springer, New York, pages 397-420).

In some embodiments of the present invention, a diagonal lineardiscriminant analysis, k-nearest neighbor algorithm, support vectormachine (SVM) algorithm, linear support vector machine, random forestalgorithm, or a probabilistic model-based method or a combinationthereof is provided for the detection of one or more genomic regions. Insome embodiments, identified markers that distinguish samples (e.g.,diseased versus normal) or distinguish genomic regions (e.g., copynumber variation versus. normal) are selected based on statisticalsignificance of the difference in expression levels between classes ofinterest. In some cases, the statistical significance is adjusted byapplying a Benjamini Hochberg or another correction for false discoveryrate (FDR).

In some cases, the algorithm may be supplemented with a meta-analysisapproach such as that described by Fishel and Kaufman et al. 2007Bioinformatics 23(13): 1599-606. In some cases, the algorithm may besupplemented with a meta-analysis approach such as a repeatabilityanalysis. In some cases, the repeatability analysis selects markers thatappear in at least one predictive expression product marker set.

A statistical evaluation of the detection of the genomic regions mayprovide a quantitative value or values indicative of one or more of thefollowing: the likelihood of diagnostic accuracy; the likelihood ofdisorder, disease, condition and the like; the likelihood of aparticular disorder, disease or condition; and the likelihood of thesuccess of a particular therapeutic intervention. Thus, a physician, whois not likely to be trained in genetics or molecular biology, need notunderstand the raw data. Rather, the data is presented directly to thephysician in the form of the quantitative values to guide patient care.The results can be statistically evaluated using a number of methodsknown to the art including, but not limited to: the student's T test,the two-sided T test, Pearson rank sum analysis, Hidden Markov ModelAnalysis, analysis of q-q plots, principal component analysis, one wayANOVA, two way ANOVA, LIMMA, and the like.

Polymorphisms

A polymorphism can include the occurrence of two or more geneticallydetermined alternative sequences or alleles in a population. Apolymorphic marker or site is the locus at which divergence occurs.Preferred markers have at least two alleles, each occurring at afrequency of preferably greater than 1%, and more preferably greaterthan 10% or 20% of a selected population. A polymorphism may compriseone or more base changes, an insertion, a repeat, or a deletion. Apolymorphic locus may be as small as one base pair. Polymorphic markersinclude single nucleotide polymorphisms (SNP's) or single nucleotidevariations, copy number variations (CNV's), restriction fragment lengthpolymorphisms (RFLP's), short tandem repeats (STRs), variable number oftandem repeats (VNTR's), hypervariable regions, mini satellites,dinucleotide repeats, trinucleotide repeats, tetranucleotide repeats,simple sequence repeats, and insertion elements such as Alu. Apolymorphism between two nucleic acids can occur naturally, or be causedby exposure to or contact with chemicals, enzymes, or other agents, orexposure to agents that cause damage to nucleic acids, for example,ultraviolet radiation, mutagens or carcinogens.

In some embodiments, polymorphisms are determined for one or more genesinvolved in one or more of different metabolic or signaling pathways. Insome cases, the methods of the present invention provide for analysis ofpolymorphisms of at least one gene of 1, 2, 3, 4, 5, 6, 7, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65,70, 75, 80, 85, 90, 95, 100 or more different metabolic or signalingpathways.

Methods and systems described herein can be used to discriminate andquantitate a genomic region containing a polymorphism. The methodsdescribed herein can discriminate and quantitate at least 1, 2, 3, 4, 5,10, 20, 50, 100, 200, 500, 1,000, 2,000, 5,000, 10,000, 20,000, 50,000,100,000, 200,000, 300,000, 400,000, 500,000, 600,000, 700,000, 800,000,900,000, 1,000,000, 2,000,000, 3,000,000 or more polymorphismsoriginating from one or more samples. In some embodiments, the methodsdescribed herein can discriminate and quantitate at least 1, 2, 3, 4, 5,10, 20, 50, 100, 200, 500, 1,000, 2,000, 5,000, 10,000, 20,000, 50,000,100,000, 200,000, 300,000, 400,000, 500,000, 600,000, 700,000, 800,000,900,000, 1,000,000, 2,000,000, 3,000,000, or more different polymorphicmarkers originating from one or more samples. In some embodiments, themethods described herein can discriminate and quantitate at least 1, 2,3, 4, 5, 10, 20, 50, 100, 200, 500, 1,000, 2,000, 5,000, 10,000, 20,000,50,000, 100,000, 200,000, 300,000, 400,000, 500,000, 600,000, 700,000,800,000, 900,000, 1,000,000, 2,000,000, 3,000,000, or more differentSNPs originating from one or more samples.

In some embodiments, the methods described herein are used to detectand/or quantify genomic regions by mapping the region to the genome of aspecies. In some embodiments, the methods described herein candiscriminate and quantitate a genomic region from a species. The methodsdescribed herein can discriminate and quantitate of at least 1, 2, 3, 4,5, 10, 20, 50, 100, 200, 500, 1,000, 2,000, 5,000, 10,000, 20,000,50,000, 100,000, 200,000, 300,000, 400,000, 500,000, 600,000, 700,000,800,000, 900,000, 1,000,000, 2,000,000, 3,000,000, or more genomicregions from a species.

Methods and systems described herein may comprise the detection ofgenetic variants. In some instances, at least about 2, at least about 3,at least about 4, at least about 5, at least about 10, at least about15, at least about 20, at least about 30, at least about 40, at leastabout 50, at least about 100, at least about 200, at least about 300, atleast about 400, at least about 500, at least about 600, at least about700, at least about 800, at least about 900, or at least about 1000genetic variants are detected in a single reaction. In another example,at least about 2000, at least about 5000, at least about 10000, at leastabout 15000, at least about 20000, at least about 30000, at least about40000, at least about 50000, at least about 100000, at least about200000, at least about 300000, at least about 400000, at least about500000, at least about 600000, at least about 700000, at least about800000, at least about 900000, or at least about 1000000 geneticvariants are detected in a single reaction. De novo assembly comprisingan N50 value or median of 50, 75, 100, 200, 300, 400, 500, 600, 700,800, 900, 1000 kilobasepairs or more can be achieved using the methodsand systems described herein. Genetic variants within these assembledsequences can be identified according to the methods of the invention.

In some cases, genomic analysis such as sequencing may be performed incombination with any of the other methods described herein. For example,a sample may be obtained, tested for adequacy, and divided into aliquotsor subsets of nucleic acid sample. One or more subsets of nucleic acidsample may then be used for target-specific sequencing of the presentinvention, and one or more may be used for low coverage whole genomesequencing methods of the present invention. It is further understoodthat the present invention anticipates that one skilled in the art maywish to perform other analyses on the biological sample that are notexplicitly provided herein.

Methods of Sequencing

The methods and systems as disclosed herein may comprise, or comprisethe use of, data from one or more sequencing reactions on one or morenucleic acid molecules in a sample. The methods and systems disclosedherein may comprise data from 1 or more, 2 or more, 3 or more, 4 ormore, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more,15 or more, 20 or more, 30 or more, 40 or more, 50 or more, 60 or more,70 or more, 80 or more, 90 or more, 100 or more, 200 or more, 300 ormore, 400 or more, 500 or more, 600 or more, 700 or more, 800 or more,900 or more, or 1000 or more sequencing reactions on one or more nucleicacid molecules in a sample. The sequencing reactions may be runsimultaneously, sequentially, or a combination thereof. The sequencingreactions may comprise whole genome sequencing or whole exome sequencingor one or more additional sequencing methods. The sequencing reactionsmay comprise Maxim-Gilbert, chain-termination or high-throughputsystems.

The methods and systems disclosed herein may comprise data from at leastone long read sequencing reaction and at least one short read sequencingreaction. The long read sequencing reaction and/or short read sequencingreaction data may comprise at least a portion of a subset of nucleicacid molecules. The long read sequencing reaction and/or short readsequencing reaction data may comprise at least a portion of two or moresubsets of nucleic acid molecules. Both a long read sequencing reactionand a short read sequencing reaction data may comprise at least aportion of one or more subsets of nucleic acid molecules.

Sequencing of the one or more nucleic acid molecules or subsets thereofmay comprise at least about 1; 2; 3; 4; 5; 6; 7; 8; 9; 10; 15; 20; 25;30; 35; 40; 45; 50; 60; 70; 80; 90; 100; 200; 300; 400; 500; 600; 700;800; 900; 1,000; 1500; 2,000; 2500; 3,000; 3500; 4,000; 4500; 5,000;5500; 6,000; 6500; 7,000; 7500; 8,000; 8500; 9,000; 10,000; 25,000;50,000; 75,000; 100,000; 250,000; 500,000; 750,000; 10,000,000;25,000,000; 50,000,000; 100,000,000; 250,000,000; 500,000,000;750,000,000; 1,000,000,000 or more sequencing reads.

Sequencing data may comprise at least about 50; 60; 70; 80; 90; 100;110; 120; 130; 140; 150; 160; 170; 180; 190; 200; 210; 220; 230; 240;250; 260; 270; 280; 290; 300; 325; 350; 375; 400; 425; 450; 475; 500;600; 700; 800; 900; 1,000; 1500; 2,000; 2500; 3,000; 3500; 4,000; 4500;5,000; 5500; 6,000; 6500; 7,000; 7500; 8,000; 8500; 9,000; 10,000;20,000; 30,000; 40,000; 50,000; 60,000; 70,000; 80,000; 90,000; 100,000or more bases or basepairs of one or more nucleic acid molecules.Sequencing data may comprise at least about 50; 60; 70; 80; 90; 100;110; 120; 130; 140; 150; 160; 170; 180; 190; 200; 210; 220; 230; 240;250; 260; 270; 280; 290; 300; 325; 350; 375; 400; 425; 450; 475; 500;600; 700; 800; 900; 1,000; 1500; 2,000; 2500; 3,000; 3500; 4,000; 4500;5,000; 5500; 6,000; 6500; 7,000; 7500; 8,000; 8500; 9,000; 10,000;20,000; 30,000; 40,000; 50,000; 60,000; 70,000; 80,000; 90,000; 100,000or more consecutive bases or basepairs of one or more nucleic acidmolecules.

Preferably, the sequencing data in the methods and systems of theinvention can comprise at least about 30 basepairs, at least about 40basepairs, at least about 50 basepairs, at least about 60 basepairs, atleast about 70 basepairs, at least about 80 basepairs, at least about 90basepairs, at least about 100 basepairs, at least about 110, at leastabout 120 basepairs per read, at least about 150 basepairs, at leastabout 200 basepairs, at least about 250 basepairs, at least about 300basepairs, at least about 350 basepairs, at least about 400 basepairs,at least about 450 basepairs, at least about 500 basepairs, at leastabout 550 basepairs, at least about 600 basepairs, at least about 700basepairs, at least about 800 basepairs, at least about 900 basepairs,or at least about 1,000 basepairs per read. Alternatively, thesequencing data in the methods and systems of the invention can compriselong sequencing reads. In some instances, the sequencing data used inthe methods systems of the invention can comprise at least about 1,200basepairs per read, at least about 1,500 basepairs per read, at leastabout 1,800 basepairs per read, at least about 2,000 basepairs per read,at least about 2,500 basepairs per read, at least about 3,000 basepairsper read, at least about 3,500 basepairs per read, at least about 4,000basepairs per read, at least about 4,500 basepairs per read, at leastabout 5,000 basepairs per read, at least about 6,000 basepairs per read,at least about 7,000 basepairs per read, at least about 8,000 basepairsper read, at least about 9,000 basepairs per read, at least about 10,000basepairs per read, 20,000 basepairs per read, 30,000 basepairs perread, 40,000 basepairs per read, 50,000 basepairs per read, 60,000basepairs per read, 70,000 basepairs per read, 80,000 basepairs perread, 90,000 basepairs per read, or 100,000 basepairs per read.

High-throughput sequencing systems may allow detection of a sequencednucleotide immediately after or upon its incorporation into a growingstrand, i.e., detection of sequence in real time or substantially realtime. In some cases, high throughput sequencing generates at least1,000, at least 5,000, at least 10,000, at least 20,000, at least30,000, at least 40,000, at least 50,000, at least 100,000 or at least500,000 sequence reads per hour; with each read being at least 50, atleast 60, at least 70, at least 80, at least 90, at least 100, at least120, at least 150, at least 200, at least 250, at least 300, at least350, at least 400, at least 450, or at least 500 bases per read.Sequencing can be performed using the nucleic acids described herein,such as genomic DNA, cDNA derived from RNA transcripts or RNA as atemplate.

The methods and systems as described in the current invention may alsocomprise comprehensive and targeted RNA expression detection data. Forexample, the invention provides for data via whole transcriptomesequencing or amplification. Whole transcriptome sequencing oramplification allows one to determine the expression of all RNAmolecules comprising messenger RNA (mRNA), ribosomal RNA (rRNA),transfer RNA (tRNA), and non-coding RNA. Targeted RNA sequencing oramplification captures sequences of RNA from a relevant subset of atranscriptome in order to view high interest genes.

Sequencing may be conducted by any method known in the art. Targetedsequencing methods can include enrichment or amplification of a regionof interest in the nucleic acid such as one or more polymorphisms, setsof genes, or genomic regions prior to sequencing. The enrichment or thetargeted sequencing can comprise the use of one or more non-random orspecific primers to amplify, enrich, or detect the region of interest.The non-random primers can be any non-random primers that are known inthe art. For example, without limitation, primers can target one or moregenes, exons, untranslated regions, specific genomic regions or acombination thereof

Untargeted sequencing methods can include the use of one or more randomor non-specific primers to enrich or amplify sequences non-specifically.Random primers can be any random primers that are known in the art. Auniversal primer is an example of a random primer. Non-target specificprimers can be an example of a random primer. Non-limiting examples ofrandom primers include random hexamers or hexamers and anchored-dTprimer. Enrichment or amplification using random primers prior tosequencing allow amplification of non-specific sequences evenly coveringthe entire sequence coverage in a nucleic acid sample.

DNA sequencing techniques include classic dideoxy sequencing reactions(Sanger method) using labeled terminators or primers and gel separationin slab or capillary, sequencing by synthesis using reversiblyterminated labeled nucleotides, pyrosequencing, 454 sequencing, allelespecific hybridization to a library of labeled oligonucleotide probes,sequencing by synthesis using allele specific hybridization to a libraryof labeled clones that is followed by ligation, real time monitoring ofthe incorporation of labeled nucleotides during a polymerization step,polony sequencing, and SOLiD sequencing. Sequencing of separatedmolecules has more recently been demonstrated by sequential or singleextension reactions using polymerases or ligases as well as by single orsequential differential hybridizations with libraries of probes.

A sequencing technique that can be used with methods of the presentdisclosure includes, for example, Helicos True Single MoleculeSequencing (tSMS) (Harris T. D. et al. (2008) Science 320:106-109). Inthe tSMS technique, a DNA sample is cleaved into template fragments ofapproximately 100 to 200 nucleotides in length, and a poly(A) sequenceis subsequently added to the 3′ end of each DNA template fragment. Eachtemplate fragment is labeled by the addition of a fluorescently labeledadenosine nucleotide. The DNA strands are then immobilized onto thesurface of a flow cell, through hybridization with oligo-dT captureprobes that are immobilized at specific sites on the surface of the flowcell. The sites can be at a density of about 100 million sites/cm.sup.2.The flow cell is then loaded into an instrument, e.g., HeliScope™.sequencer, and a laser illuminates the surface of the flow cell,revealing the position of each DNA template fragment. A CCD camera canmap the position of the template fragments on the flow cell surface. Thefluorescent label of each DNA template fragment is then cleaved andwashed away.

The sequencing reaction begins by introducing a DNA polymerase and afluorescently labeled nucleotide. The oligo-dT nucleic acid serves as aprimer, and the polymerase incorporates the labeled nucleotides to theprimer in a template directed manner. The polymerase and unincorporatednucleotides are then removed. The templates that have directedincorporation of the fluorescently labeled nucleotide are detected byimaging the flow cell surface. After imaging, a cleavage step removesthe fluorescent label, and the process is repeated with otherfluorescently labeled nucleotides until the desired read length isachieved. Sequence information is collected with each nucleotideaddition step. Further description of tSMS is shown for example inLapidus et al. (U.S. Pat. No. 7,169,560), Lapidus et al. (U.S. patentapplication number 2009/0191565), Quake et al. (U.S. Pat. No.6,818,395), Harris (U.S. Pat. No. 7,282,337), Quake et al. (U.S. patentapplication number 2002/0164629), and Braslaysky, et al., and PNAS(USA), 100: 3960-3964 (2003). The contents of each of these referencesare incorporated by reference herein in its entirety.

An RNA sequence can also be detected by single molecule sequencing suchas with the Helicos Direct RNA sequencing method, described in FatihOzsolak, et al., Direct RNA sequencing. Nature 461, 814-818. Total RNAor RNA fragments with natural poly(A) tails are introduced to poly(dT)coated flow cells in order to enable capture and sequencing of poly(A)RNA species. In situations where the RNA does not have a poly(A) tail,for example small sample species, a poly(A) polymerase is introduced tothe RNA in order to generate a poly(A) tail so that the sample RNA mayattach to the flow cells to enable capture and sequencing.

Another example of a DNA and RNA sequencing technique that can be usedwith methods of the present disclosure is 454 sequencing (Roche)(Margulies, M et al. 2005, Nature, 437, 376-380). 454 sequencing is asequencing-by-synthesis technology that utilizes pyrosequencing. 454sequencing of DNA involves two steps. In the first step, DNA is shearedinto fragments of approximately 300-800 basepairs, and the fragments areblunt ended. Oligonucleotide adaptors are then ligated to the ends ofthe fragments. The adaptors serve as primers for amplification andsequencing of the fragments. The fragments can be attached to DNAcapture beads, e.g., streptavidin-coated beads using, e.g., Adaptor B,which contains a 5′-biotin tag. The fragments attached to the beads arePCR amplified within droplets of an oil-water emulsion, resulting inmultiple copies of clonally amplified DNA fragments on each bead. In thesecond step, the beads are captured in wells (pico-liter sized).Pyrosequencing is performed on each DNA fragment in parallel. Additionof one or more nucleotides generates a light signal that is recorded bya CCD camera in a sequencing instrument. The signal strength isproportional to the number of nucleotides incorporated. Pyrosequencingmakes use of pyrophosphate (PPi) which is released upon nucleotideaddition. PPi is converted to ATP by ATP sulfurylase in the presence ofadenosine 5′ phosphosulfate. Luciferase uses ATP to convert luciferin tooxyluciferin, and this reaction generates light that is detected andanalyzed. In another embodiment, pyrosequencing is used to measure geneexpression. Pyrosequecing of RNA applies may be conducted similarly topyrosequencing of DNA, and is accomplished by attaching partial rRNAgene sequences to microscopic beads and then placing the attachmentsinto individual wells. The attached partial rRNA sequences are thenamplified in order to determine the gene expression profile, asdescribed in Sharon Marsh, Pyrosequencing.RTM. Protocols in Methods inMolecular Biology, Vol. 373, 15-23 (2007).

Another example of DNA and RNA detection techniques that may be usedwith methods of the present disclosure is SOLiD technology (AppliedBiosystems). SOLiD technology systems is a ligation based sequencingtechnology that may be utilized to run massively parallel nextgeneration sequencing of both DNA and RNA. In DNA SOLiD sequencing,genomic DNA is sheared into fragments, and adaptors are attached to the5′ and 3′ ends of the fragments to generate a fragment library.Alternatively, internal adaptors can be introduced by ligating adaptorsto the 5′ and 3′ ends of the fragments, circularizing the fragments,digesting the circularized fragment to generate an internal adaptor, andattaching adaptors to the 5′ and 3′ ends of the resulting fragments togenerate a mate-paired library. Next, clonal bead populations areprepared in microreactors containing beads, primers, template, and PCRcomponents. Following PCR, the templates are denatured and beads areenriched to separate the beads with extended templates. Templates on theselected beads are subjected to a 3′ modification that permits bondingto a glass slide. The sequence can be determined by sequentialhybridization and ligation of partially random oligonucleotides with acentral determined base (or pair of bases) that is identified by aspecific fluorophore. After a color is recorded, the ligatedoligonucleotide is cleaved and removed and the process is then repeated.

In other embodiments, SOLiD Serial Analysis of Gene Expression (SAGE) isused to measure gene expression. Serial analysis of gene expression(SAGE) is a method that allows the simultaneous and quantitativeanalysis of a large number of gene transcripts, without the need ofproviding an individual hybridization probe for each transcript. First,a short sequence tag (about 10-14 bp) is generated that containssufficient information to uniquely identify a transcript, provided thatthe tag is obtained from a unique position within each transcript. Then,many transcripts are linked together to form long serial molecules, thatcan be sequenced, revealing the sequence of the multiple transcriptssimultaneously. The expression pattern of any population of transcriptscan be quantitatively evaluated by determining the abundance ofindividual tags, and identifying the gene corresponding to each tag. Formore details see, for example, Velculescu et al., Science 270:484 487(1995); and Velculescu et al., Cell 88:243 51 (1997), the contents ofeach of which are incorporated by reference herein in their entirety).

Other examples of nucleic acid (e.g., DNA or RNA) sequencing techniquesthat may be used with the methods of the present disclosure are providedin U.S. Patent Publication Nos. 2009/0026082, 2009/0127589,2010/0035252, 2010/0137143, 2010/0188073, 2010/0197507, 2010/0282617,2010/0300559), 2010/0300895, 2010/0301398, and 2010/0304982, thecontents of each of which is incorporated by reference herein in itsentirety. In an example, DNA is sheared into fragments of approximately300-800 basepairs, and the fragments are blunt ended. Oligonucleotideadaptors are then ligated to the ends of the fragments. The adaptorsserve as primers for amplification and sequencing of the fragments. Thefragments can be attached to a surface and are attached at a resolutionsuch that the fragments are individually resolvable. Addition of one ormore nucleotides releases a proton (H.sup.+), and the signal is detectedand recorded by a sequencing instrument. The signal strength isproportional to the number of nucleotides incorporated.

Another example of a sequencing technology that can be used with methodsof the present disclosure is Illumina sequencing, which is apolymerase-based sequence-by-synthesis that may be utilized to amplifyDNA or RNA. Illumina sequencing for DNA is based on the amplification ofDNA on a solid surface using fold-back PCR and anchored primers. GenomicDNA is fragmented, and adapters are added to the 5′ and 3′ ends of thefragments. DNA fragments that are attached to the surface of flow cellchannels are extended and bridge amplified. The fragments become doublestranded, and the double stranded molecules are denatured. Multiplecycles of the solid-phase amplification followed by denaturation cancreate several million clusters of approximately 1,000 single-strandedDNA copies of of the same template in each channel of the flow cell.Primers, DNA polymerase, and four fluorophore-labeled, reversiblyterminating nucleotides are used to perform sequential sequencing. Afternucleotide incorporation, a laser is used to excite the fluorophores,and an image is captured and the identity of the first base is recorded.The 3′ terminators and fluorophores from each incorporated base areremoved, and the incorporation, detection, and identification steps arerepeated. When using Illumina sequencing to detect RNA, the same methodapplies except that RNA fragments are being isolated and amplified inorder to determine the RNA expression profile of the sample. In someembodiments, high-throughput sequencing involves the use of technologyavailable by Illumina's Genome Analyzer IIX, MiSeq personal sequencer,or HiSeq systems, such as those using HiSeq 2500, HiSeq 1500, HiSeq2000, or HiSeq 1000. These machines use reversible terminator-basedsequencing by synthesis chemistry. These machines can perform 200billion or more DNA reads in eight days. Smaller systems may be utilizedfor performing runs within 3, 2, 1 days or less time. Short synthesiscycles may be used to minimize the time it takes to obtain sequencingresults.

Another example of a sequencing technology that may be used with methodsof the present disclosure includes the single molecule, real-time (SMRT)technology of Pacific Biosciences to sequence both DNA and RNA. In SMRT,each of the four DNA bases is attached to one of four differentfluorescent dyes. These dyes are phospholinked. A single DNA polymeraseis immobilized with a single molecule of template single stranded DNA atthe bottom of a zero-mode waveguide (ZMW). A ZMW is a confinementstructure which enables observation of incorporation of a singlenucleotide by DNA polymerase against the background of fluorescentnucleotides that rapidly diffuse in and out of the ZMW (inmicroseconds). It takes several milliseconds to incorporate a nucleotideinto a growing strand. During this time, the fluorescent label isexcited and produces a fluorescent signal, and the fluorescent tag iscleaved off. Detection of the corresponding fluorescence of the dyeindicates which base was incorporated. The process is repeated. In orderto sequence RNA, the DNA polymerase is replaced with a reversetranscriptase in the ZMW, and the process is followed accordingly.

Another example of a sequencing technique that can be used with methodsof the present disclosure is nanopore sequencing (Soni G V and Meller, AClin Chem 53: 1996-2001) (2007). A nanopore is a small hole, of theorder of 1 nanometer in diameter. Immersion of a nanopore in aconducting fluid, and application of a potential across it, results in aslight electrical current due to conduction of ions through thenanopore. The amount of current which flows is sensitive to the size ofthe nanopore. As a DNA molecule passes through a nanopore, eachnucleotide on the DNA molecule obstructs the nanopore to a differentdegree. Thus, the change in the current passing through the nanopore asthe DNA molecule passes through the nanopore may be used to determinethe DNA sequence.

The next generation sequencing can comprise DNA nanoball sequencing (asperformed, e.g., by Complete Genomics; see e.g., Drmanac et al. (2010)Science 327: 78-81). DNA can be isolated, fragmented, and size selected.For example, DNA can be fragmented (e.g., by sonication) to a meanlength of about 500 basepairs. Adaptors (Adl) can be attached to theends of the fragments. The adaptors can be used to hybridize to anchorsfor sequencing reactions. DNA with adaptors bound to each end can be PCRamplified. The adaptor sequences can be modified so that complementarysingle strand ends bind to each other forming circular DNA. The DNA canbe methylated to protect it from cleavage by a type IIS restrictionenzyme used in a subsequent step. An adaptor (e.g., the right adaptor)can have a restriction recognition site, and the restriction recognitionsite can remain non-methylated. The non-methylated restrictionrecognition site in the adaptor can be recognized by a restrictionenzyme (e.g., Acul), and the DNA can be cleaved by Acul 13 bp to theright of the right adaptor to form linear double stranded DNA. A secondround of right and left adaptors (Ad2) can be ligated onto either end ofthe linear DNA, and all DNA with both adapters bound can be PCRamplified (e.g., by PCR). Ad2 sequences can be modified to allow them tobind each other and form circular DNA. The DNA can be methylated, but arestriction enzyme recognition site can remain non-methylated on theleft Adl adapter. A restriction enzyme (e.g., Acul) can be applied, andthe DNA can be cleaved 13 bp to the left of the Adl to form a linear DNAfragment. A third round of right and left adaptors (Ad3) can be ligatedto the right and left flank of the linear DNA, and the resultingfragment can be PCR amplified. The adaptors can be modified so that theycan bind to each other and form circular DNA. A type III restrictionenzyme (e.g., EcoP15) can be added; EcoP15 can cleave the DNA 26 bp tothe left of Ad3 and 26 bp to the right of Ad2. This cleavage can removea large segment of DNA and linearize the DNA once again. A fourth roundof right and left adaptors (Ad4) can be ligated to the DNA, the DNA canbe amplified (e.g., by PCR), and modified so that the adaptors bind eachother and form the completed circular DNA template.

Another example of a sequencing technique that can be used with methodsof the present disclosure involves using a chemical-sensitive fieldeffect transistor (chemFET) or ion sensitive field effect transistor(ISFET) array to sequence a nucleic acid (for example, as described inUS Patent Application Publication No. 2009/0026082). In one example ofthe technique, DNA molecules can be placed into reaction chambers, andthe template molecules can be hybridized to a sequencing primer bound toa polymerase. Incorporation of one or more triphosphates into a newnucleic acid strand at the 3′ end of the sequencing primer can bedetected by a change in current by a chemFET. An array can have multiplechemFET sensors. In another example, single nucleic acids can beattached to beads, and the single nucleic acids can be amplified on thebead. The individual beads can then be transferred to individualreaction chambers on a chemFET array, with each chamber having a chemFETsensor, and the nucleic acids can be sequenced.

Another example of a sequencing technique that can be used with methodsof the present disclosure involves using an electron microscope(Moudrianakis E. N. and Beer M. Proc Natl Acad Sci USA. 1965 March;53:564-71). In one example of the technique, individual DNA moleculesare labeled using metallic labels that are distinguishable using anelectron microscope. These molecules are then stretched on a flatsurface and imaged using an electron microscope to determine sequenceidentities.

Exemplary methods for calling variations in a polynucleotide sequencecompared to a reference polynucleotide sequence and for polynucleotidesequence assembly (or reassembly), for example, are provided in U.S.patent publication No. 201 1-0004413, (application Ser. No. 12/770,089)which is incorporated herein by reference in its entirety for allpurposes. See also Drmanac et al., Science 327,78-81, 2010. Alsoincorporated by reference in their entirety and for all purposes arecopending related application No. 61/623,876, entitled “IdentificationOf DNA Fragments And Structural Variations” and Ser. No. 13/447,087,entitled “Processing and Analysis of Complex Nucleic Acid SequenceData.” Other methods of sequencing or sample processing are described inPCT application No. WO2012142611 A2, hereby incorporated by reference inits entirety.

Assays

In some embodiments, the nucleic acid sample described herein can besubjected to varieties of assays. Assays may include, but are notlimited to, sequencing, amplification, hybridization, enrichment,isolation, elution, fragmentation, detection, and quantification of oneor more nucleic acid molecules. Assays may include methods for preparingone or more nucleic acid molecules.

In some embodiments, the nucleic acids in the nucleic acid sampledescribed herein can be amplified. Amplification can be performed at anypoint during a multi reaction procedure using the methods and systems ofthe invention, e.g., before or after pooling of sequencing librariesfrom independent reaction volumes and may be used to amplify anysuitable target molecule described herein.

Amplification can be performed by any methods or systems known in theart. The nucleic acids may be amplified by polymerase chain reaction(PCR), as described in, for example, U.S. Pat. Nos. 5,928,907 and6,015,674, hereby incorporated by reference for any purpose. Othermethods of nucleic acid amplification may include, for example, ligasechain reaction, oligonucleotide ligations assay, and hybridizationassay, as described in greater detail in U.S. Pat. Nos. 5,928,907 and6,015,674, incorporated by reference in their entirety. Real-timeoptical detection systems are also known in the art, as also describedin greater detail in, for example, U.S. Pat. Nos. 5,928,907 and6,015,674, incorporated herein above. Other amplification methods thatcan be used herein include those described in U.S. Pat. Nos. 5,242,794;5,494,810; 4,988,617; and 6,582,938, all of which are incorporatedherein in their entirety. Other amplification techniques that can beused with methods of the present disclosure can include, e.g., AFLP(amplified fragment length polymorphism) PCR (see e.g.: Vos et al. 1995.AFLP: a new technique for DNA fingerprinting. Nucleic Acids Research 23:4407-14), allele-specific PCR (see e.g., Saiki R K, Bugawan T L, Horn GT, Mullis K B, Erlich H A (1986). Analysis of enzymatically amplifiedbeta-globin and HLA-DQ alpha DNA with allele-specific oligonucleotideprobes Nature 324: 163-166), Alu PCR, assembly PCR (see e.g., Stemmer WP, Crameri A, Ha K D, Brennan T M, Heyneker H L (1995). Single-stepassembly of a gene and entire plasmid from large numbers ofoligodeoxyribonucleotides Gene 164: 49-53), assymetric PCR (see e.g.,Saiki R K supra), colony PCR, helicase dependent PCR (see e.g., MyriamVincent, Yan Xu and Huimin Kong (2004). Helicase-dependent isothermalDNA amplification EMBO reports 5 (8): 795-800), hot start PCR, inversePCR (see e.g., Ochman H, Gerber A S, Hartl D L. Genetics. 1988 November;120(3):621-3), in situ PCR, intersequence-specific PCR or IS SR PCR,digital PCR, linear-after-the-exponential-PCR or Late PCR (see e.g.,Pierce K E and Wangh L T (2007). Linear-after-the-exponential polymerasechain reaction and allied technologies Real-time detection strategiesfor rapid, reliable diagnosis from single cells (Methods Mol. Med. 132:65-85), long PCR, nested PCR, real-time PCR, duplex PCR, multiplex PCR,quantitative PCR, quantitative fluorescent PCR (QF-PCR), multiplexfluorescent PCR (MF-PCR), restriction fragment length polymorphism PCR(PCR-RFLP), PCK-RFLPIRT-PCR-IRFLP, polony PCR, in situ rolling circleamplification (RCA), bridge PCR, picotiter PCR, and emulsion PCR, orsingle cell PCR. Other suitable amplification methods can includetranscription amplification, self-sustained sequence replication,selective amplification of target polynucleotide sequences, consensussequence primed polymerase chain reaction (CP-PCR), arbitrarily primedpolymerase chain reaction (AP-PCR), and degenerateoligonucleotide-primed PCR (DOP-PCR). Another method for achieving theresult of an amplification of nucleic acids is known as the ligase chainreaction (LCR), nucleic acid sequence based amplification (NASBA),Q-beta-replicase method, 3SR (see for example Fahy et al. PCR MethodsAppl. 1:25-33 (1991)), or Transcription Mediated Amplification (TMA)used by Gen-Probe. TMA is similar to NASBA in utilizing two enzymes in aself-sustained sequence replication. See U.S. Pat. No. 5,299,491 hereinincorporated by reference. Other methods for amplification of nucleicacids can include Strand Displacement Amplification (SDA) (Westin et al2000, Nature Biotechnology, 18, 199-202; Walker et al 1992, NucleicAcids Research, 20, 7, 1691-1696), or Rolling Circle Amplification (RCA)(Lizardi et al. 1998, Nature Genetics, 19:225-232).

In some embodiments, amplification methods can be solid-phaseamplification, polony amplification, colony amplification, emulsion PCR,bead RCA, surface RCA, surface SDA, etc., as may be recognized by one ofordinary skill in the art. In some embodiments, amplification methodsthat results in amplification of free DNA molecules in solution ortethered to a suitable matrix by only one end of the DNA molecule can beused. Methods that rely on bridge PCR, where both PCR primers areattached to a surface (see, e.g., WO 2000/018957 and Adessi et al.,Nucleic Acids Research (2000): 28(20): E87) can be used. In some casesthe methods of the invention can create a “polymerase colonytechnology,” or “polony.” referring to a multiplex amplification thatmaintains spatial clustering of identical amplicons (see HarvardMolecular Technology Group and Lipper Center for Computational Geneticswebsite). These include, for example, in situ polonies (Mitra andChurch, Nucleic Acid Research 27, e34, Dec. 15, 1999), in situ rollingcircle amplification (RCA) (Lizardi et al., Nature Genetics 19, 225,July 1998), bridge PCR (U.S. Pat. No. 5,641,658), picotiter PCR (Leamonet al., Electrophoresis 24, 3769, November 2003), and emulsion PCR(Dressman et al., PNAS 100, 8817, Jul. 22, 2003). The methods of theinvention provide new methods for generating and using polonies.

Amplification may be achieved through any process by which the copynumber of a target sequence is increased, e.g., PCR. Conditionsfavorable to the amplification of target sequences by PCR are known inthe art, can be optimized at a variety of steps in the process, anddepend on characteristics of elements in the reaction, such as targettype, target concentration, sequence length to be amplified, sequence ofthe target and/or one or more primers, primer length, primerconcentration, polymerase used, reaction volume, ratio of one or moreelements to one or more other elements, and others, some or all of whichcan be altered. In general, PCR involves the steps of denaturation ofthe target to be amplified (if double stranded), hybridization of one ormore primers to the target, and extension of the primers by a DNApolymerase, with the steps repeated (or “cycled”) in order to amplifythe target sequence. Steps in this process can be optimized for variousoutcomes, such as to enhance yield, decrease the formation of spuriousproducts, and/or increase or decrease specificity of primer annealing.Methods of optimization are well known in the art and includeadjustments to the type or amount of elements in the amplificationreaction and/or to the conditions of a given step in the process, suchas temperature at a particular step, duration of a particular step,and/or number of cycles. In some embodiments, an amplification reactioncomprises at least 5, 10, 15, 20, 25, 30, 35, 50, or more cycles. Insome embodiments, an amplification reaction comprises no more than 5,10, 15, 20, 25, 35, 50, or more cycles. Cycles can contain any number ofsteps, such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more steps. Steps cancomprise any temperature or gradient of temperatures, suitable forachieving the purpose of the given step, including but not limited to,3′ end extension (e.g., adaptor fill-in), primer annealing, primerextension, and strand denaturation. Steps can be of any duration,including but not limited to about, less than about, or more than about1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 70, 80, 90, 100, 120,180, 240, 300, 360, 420, 480, 540, 600, or more seconds, includingindefinitely until manually interrupted. Cycles of any number comprisingdifferent steps can be combined in any order. In some embodiments,different cycles comprising different steps are combined such that thetotal number of cycles in the combination is about, less that about, ormore than about 5, 10, 15, 20, 25, 30, 35, 50, or more cycles.

The methods disclosed herein may further comprise conducting one or morehybridization reactions on one or more nucleic acid molecules in asample. The hybridization reactions may comprise the hybridization ofone or more capture probes to one or more nucleic acid molecules in asample or subset of nucleic acid molecules. The hybridization reactionsmay comprise hybridizing one or more capture probe sets to one or morenucleic acid molecules in a sample or subset of nucleic acid molecules.The hybridization reactions may comprise one or more hybridizationarrays, multiplex hybridization reactions, hybridization chainreactions, isothermal hybridization reactions, nucleic acidhybridization reactions, or a combination thereof. The one or morehybridization arrays may comprise hybridization array genotyping,hybridization array proportional sensing, DNA hybridization arrays,macroarrays, microarrays, high-density oligonucleotide arrays, genomichybridization arrays, comparative hybridization arrays, or a combinationthereof. The hybridization reaction may comprise one or more captureprobes, one or more beads, one or more labels, one or more subsets ofnucleic acid molecules, one or more nucleic acid samples, one or morereagents, one or more wash buffers, one or more elution buffers, one ormore hybridization buffers, one or more hybridization chambers, one ormore incubators, one or more separators, or a combination thereof

The methods disclosed herein may further comprise conducting one or moreenrichment reactions on one or more nucleic acid molecules in a sample.The enrichment reactions may comprise contacting a sample with one ormore beads or bead sets. The enrichment reaction may comprisedifferential amplification of two or more subsets of nucleic acidmolecules based on one or more genomic region features. For example, theenrichment reaction comprises differential amplification of two or moresubsets of nucleic acid molecules based on GC content. Alternatively, oradditionally, the enrichment reaction comprises differentialamplification of two or more subsets of nucleic acid molecules based onmethylation state. The enrichment reactions may comprise one or morehybridization reactions. The enrichment reactions may further compriseisolation and/or purification of one or more hybridized nucleic acidmolecules, one or more bead bound nucleic acid molecules, one or morefree nucleic acid molecules (e.g., capture probe free nucleic acidmolecules, bead free nucleic acid molecules), one or more labelednucleic acid molecules, one or more non-labeled nucleic acid molecules,one or more amplicons, one or more non-amplified nucleic acid molecules,or a combination thereof. Alternatively, or additionally, the enrichmentreaction may comprise enriching for one or more cell types in thesample. The one or more cell types may be enriched by flow cytometry.

As shown in FIG. 2 , these protocols for work flow may involveenrichment for different genomic or non-genomic regions and comprise oneor more different amplification steps to prepare libraries of nucleicacid molecules for assay. Some of these libraries may combine (2) forassay. Results of some assays may be combined (3) for subsequentanalysis. Variant calls or other assessments of sequence or geneticstate may be further combined (4) to produce a combined assessment ateach locus addressed by the assay.

The one or more enrichment reactions may produce one or more enrichednucleic acid molecules. The enriched nucleic acid molecules may comprisea nucleic acid molecule or variant or derivative thereof. For example,the enriched nucleic acid molecules comprise one or more hybridizednucleic acid molecules, one or more bead bound nucleic acid molecules,one or more free nucleic acid molecules (e.g., capture probe freenucleic acid molecules, bead free nucleic acid molecules), one or morelabeled nucleic acid molecules, one or more non-labeled nucleic acidmolecules, one or more amplicons, one or more non-amplified nucleic acidmolecules, or a combination thereof. The enriched nucleic acid moleculesmay be differentiated from non-enriched nucleic acid molecules by GCcontent, molecular size, genomic regions, genomic region features, or acombination thereof. The enriched nucleic acid molecules may be derivedfrom one or more assays, supernatants, eluents, or a combinationthereof. The enriched nucleic acid molecules may differ from thenon-enriched nucleic acid molecules by mean size, mean GC content,genomic regions, or a combination thereof.

The methods disclosed herein may further comprise conducting one or moreisolation or purification reactions on one or more nucleic acidmolecules in a sample. The isolation or purification reactions maycomprise contacting a sample with one or more beads or bead sets. Theisolation or purification reaction may comprise one or morehybridization reactions, enrichment reactions, amplification reactions,sequencing reactions, or a combination thereof. The isolation orpurification reaction may comprise the use of one or more separators.The one or more separators may comprise a magnetic separator. Theisolation or purification reaction may comprise separating bead boundnucleic acid molecules from bead free nucleic acid molecules. Theisolation or purification reaction may comprise separating capture probehybridized nucleic acid molecules from capture probe free nucleic acidmolecules. The isolation or purification reaction may compriseseparating a first subset of nucleic acid molecules from a second subsetof nucleic acid molecules, wherein the first subset of nucleic acidmolecules differ from the second subset on nucleic acid molecules bymean size, mean GC content, genomic regions, or a combination thereof.

The methods disclosed herein may further comprise conducting one or moreelution reactions on one or more nucleic acid molecules in a sample. Theelution reactions may comprise contacting a sample with one or morebeads or bead sets. The elution reaction may comprise separating beadbound nucleic acid molecules from bead free nucleic acid molecules. Theelution reaction may comprise separating capture probe hybridizednucleic acid molecules from capture probe free nucleic acid molecules.The elution reaction may comprise separating a first subset of nucleicacid molecules from a second subset of nucleic acid molecules, whereinthe first subset of nucleic acid molecules differ from the second subseton nucleic acid molecules by mean size, mean GC content, genomicregions, or a combination thereof

The methods disclosed herein may further comprise one or morefragmentation reactions. The fragmentation reactions may comprisefragmenting one or more nucleic acid molecules in a sample or subset ofnucleic acid molecules to produce one or more fragmented nucleic acidmolecules. The one or more nucleic acid molecules may be fragmented bysonication, needle shear, nebulization, shearing (e.g., acousticshearing, mechanical shearing, point-sink shearing), passage through aFrench pressure cell, or enzymatic digestion. Enzymatic digestion mayoccur by nuclease digestion (e.g., micrococcal nuclease digestion,endonucleases, exonucleases, RNAse H or DNase I). Fragmentation of theone or more nucleic acid molecules may result in fragment sized of about100 basepairs to about 2000 basepairs, about 200 basepairs to about 1500basepairs, about 200 basepairs to about 1000 basepairs, about 200basepairs to about 500 basepairs, about 500 basepairs to about 1500basepairs, and about 500 basepairs to about 1000 basepairs. The one ormore fragmentation reactions may result in fragment sized of about 50basepairs to about 1000 basepairs. The one or more fragmentationreactions may result in fragment sized of about 100 basepairs, 150basepairs, 200 basepairs, 250 basepairs, 300 basepairs, 350 basepairs,400 basepairs, 450 basepairs, 500 basepairs, 550 basepairs, 600basepairs, 650 basepairs, 700 basepairs, 750 basepairs, 800 basepairs,850 basepairs, 900 basepairs, 950 basepairs, 1000 basepairs or more.

Fragmenting the one or more nucleic acid molecules may comprisemechanical shearing of the one or more nucleic acid molecules in thesample for a period of time. The fragmentation reaction may occur for atleast about 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80,85, 90, 95, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375,400, 425, 450, 475, 500 or more seconds.

Fragmenting the one or more nucleic acid molecules may comprisecontacting a nucleic acid sample with one or more beads. Fragmenting theone or more nucleic acid molecules may comprise contacting the nucleicacid sample with a plurality of beads, wherein the ratio of the volumeof the plurality of beads to the volume of nucleic acid sample is about0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 1.00, 1.10, 1.20,1.30, 1.40, 1.50, 1.60, 1.70, 1.80, 1.90, 2.00 or more. Fragmenting theone or more nucleic acid molecules may comprise contacting the nucleicacid sample with a plurality of beads, wherein the ratio of the volumeof the plurality of beads to the volume of nucleic acid is about 2.00,1.90, 1.80, 1.70, 1.60, 1.50, 1.40, 1.30, 1.20, 1.10, 1.00, 0.90, 0.80,0.70, 0.60, 0.50, 0.40, 0.30, 0.20, 0.10, 0.05, 0.04, 0.03, 0.02, 0.01or less.

The methods disclosed herein may further comprise conducting one or moredetection reactions on one or more nucleic acid molecules in a sample.Detection reactions may comprise one or more sequencing reactions.Alternatively, conducting a detection reaction comprises opticalsensing, electrical sensing, or a combination thereof. Optical sensingmay comprise optical sensing of a photoilluminscence photon emission,fluorescence photon emission, pyrophosphate photon emission,chemiluminescence photon emission, or a combination thereof. Electricalsensing may comprise electrical sensing of an ion concentration, ioncurrent modulation, nucleotide electrical field, nucleotide tunnelingcurrent, or a combination thereof

The methods disclosed herein may further comprise conducting one or morequantification reactions on one or more nucleic acid molecules in asample. Quantification reactions may comprise sequencing, PCR, qPCR,digital PCR, or a combination thereof.

The methods disclosed herein may further comprise conducting 1 or more,2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 ormore, 9 or more, 10 or more, 11 or more, 12 or more, 13 or more, 14 ormore, 15 or more, 20 or more, 25 or more, 30 or more, 35 or more, 40 ormore, 45 or more, or 50 or more assays on a sample comprising one ormore nucleic acid molecules. The two or more assays may be different,similar, identical, or a combination thereof. For example, The methodsdisclosed herein comprise conducting two or more sequencing reactions.In another example, The methods disclosed herein comprise conducting twoor more assays, wherein at least one of the two or more assays comprisesa sequencing reaction. In yet another example, The methods disclosedherein comprise conducting two or more assays, wherein at least two ofthe two or more assays comprises a sequencing reaction and ahybridization reaction. The two or more assays may be performedsequentially, simultaneously, or a combination thereof. For example, thetwo or more sequencing reactions may be performed simultaneously. Inanother example, the methods disclosed herein comprise conducting ahybridization reaction, followed by a sequencing reaction. In yetanother example, the methods disclosed herein comprise conducting two ormore hybridization reactions simultaneously, followed by conducting twoor more sequencing reactions simultaneously. The two or more assays maybe performed by one or more devices. For example, two or moreamplification reactions may be performed by a PCR machine. In anotherexample, two or more sequencing reactions may be performed by two ormore sequencers.

Devices

The methods and systems disclosed herein may comprise one or moredevices. The methods and systems disclosed herein may comprise the useof one or more devices to perform one or more steps or assays comprisedtherein. The methods and systems disclosed herein may comprise one ormore devices and the use thereof in one or more steps or assays. Forexample, conducting a sequencing reaction may comprise one or moresequencers. In another example, combining a plurality of data inputs andgenerating a combined data may comprise the use of one or more computerprocessors. In yet another example, one or more processors may be usedin the generating and displaying electronically at least a portion ofthe data output. Exemplary devices include, but are not limited to,sequencers, computer processors, computer display, monitors, harddrives, thermocyclers, real-time PCR instruments, magnetic separators,transmission devices, hybridization chambers, electrophoresis apparatus,centrifuges, microscopes, imagers, fluorometers, luminometers, platereaders, computers, processors, and bioanalyzers.

The methods disclosed herein may comprise one or more sequencers. Theone or more sequencers may comprise one or more HiSeq, Mi Seq, HiScan,Genome Analyzer IIx, SOLiD Sequencer, Ion Torrent PGM, 454 GS Junior,Pac Bio RS, or a combination thereof. The one or more sequencers maycomprise one or more sequencing platforms. The one or more sequencingplatforms may comprise GS FLX by 454 Life Technologies/Roche, GenomeAnalyzer by Solexa/Illumina, SOLiD by Applied Biosystems, CGA Platformby Complete Genomics, PacBio RS by Pacific Biosciences, or a combinationthereof

The methods disclosed herein may comprise one or more thermocyclers. Theone or more thermocyclers may be used to amplify one or more nucleicacid molecules. The methods disclosed herein may comprise one or morereal-time PCR instruments. The one or more real-time PCR instruments maycomprise a thermal cycler and a fluorometer. The one or morethermocyclers may be used to amplify and detect one or more nucleic acidmolecules.

The methods disclosed herein may comprise one or more magneticseparators. The one or more magnetic separators may be used forseparation of paramagnetic and ferromagnetic particles from asuspension. The one or more magnetic separators may comprise one or moreLifeStep™ biomagnetic separators, SPHERO™ FlexiMag separator, SPHERO™MicroMag separator, SPHERO™ HandiMag separator, SPHERO™ MiniTube Magseparator, SPHERO™ UltraMag separator, DynaMag™ magnet, DynaMag™-2Magnet, or a combination thereof.

The methods disclosed herein may comprise one or more bioanalyzers.Generaly, a bioanalyzer is a chip-based capillary electrophoresismachine that can analyse RNA, DNA, and proteins. The one or morebioanalyzers may comprise Agileni s 2100 Bioanalyzer.

Computer systems

FIG. 1 shows a computer system (also “system” herein) 101 programmed orotherwise configured to implement the methods of the disclosure, such asreceiving and/or combining sequencing data and/or annotating sequencingdata. The system 101 includes a central processing unit (CPU, also“processor” and “computer processor” herein) 105, which can be a singlecore or multi core processor, or a plurality of processors for parallelprocessing. The system 101 also includes memory 110 (e.g., random-accessmemory, read-only memory, flash memory), electronic storage unit 115(e.g., hard disk), communications interface 120 (e.g., network adapter)for communicating with one or more other systems, and peripheral devices125, such as cache, other memory, data storage and/or electronic displayadapters. The memory 110, storage unit 115, interface 120 and peripheraldevices 125 are in communication with the CPU 105 through acommunications bus (solid lines), such as a motherboard. The storageunit 115 can be a data storage unit (or data repository) for storingdata. The system 101 is operatively coupled to a computer network(“network”) 130 with the aid of the communications interface 120. Thenetwork 130 can be the Internet, an internet and/or extranet, or anintranet and/or extranet that is in communication with the Internet. Thenetwork 130 in some cases is a telecommunication and/or data network.The network 130 can include one or more computer servers, which canenable distributed computing, such as cloud computing. The network 130in some cases, with the aid of the system 101, can implement apeer-to-peer network, which may enable devices coupled to the system 101to behave as a client or a server.

The system 101 is in communication with a processing system 135. Theprocessing system 135 can be configured to implement the methodsdisclosed herein, such as sequencing a nucleic acid sample or portionthereof. In some examples, the processing system 135 is a nucleic acidsequencing system, such as, for example, a next generation sequencingsystem (e.g., Illumina sequencer, Ion Torrent sequencer, PacificBiosciences sequencer, Oxford Nanopore Technologies). The processingsystem 135 can be in communication with the system 101 through thenetwork 130, or by direct (e.g., wired, wireless) connection. Theprocessing system 135 can be configured for analysis, such as nucleicacid sequence analysis.

Methods and systems as described herein can be implemented by way ofmachine (or computer processor) executable code (or software) stored onan electronic storage location of the system 101, such as, for example,on the memory 110 or electronic storage unit 115. During use, the codecan be executed by the processor 105. In some examples, the code can beretrieved from the storage unit 115 and stored on the memory 110 forready access by the processor 105. In some situations, the electronicstorage unit 115 can be precluded, and machine-executable instructionsare stored on memory 110.

The code can be pre-compiled and configured for use with a machine havea processer adapted to execute the code, or can be compiled duringruntime. The code can be supplied in a programming language that can beselected to enable the code to execute in a pre-compiled or as-compiledfashion.

Aspects of the systems and methods provided herein can be embodied inprogramming. Various aspects of the technology may be thought of as“products” or “articles of manufacture” typically in the form of machine(or processor) executable code and/or associated data that is carried onor embodied in a type of machine readable medium. Machine-executablecode can be stored on an electronic storage unit, such memory (e.g.,read-only memory, random-access memory, flash memory) or a hard disk.“Storage” type media can include any or all of the tangible memory ofthe computers, processors or the like, or associated modules thereof,such as various semiconductor memories, tape drives, disk drives and thelike, which may provide non-transitory storage at any time for thesoftware programming. All or portions of the software may at times becommunicated through the Internet or various other telecommunicationnetworks. Such communications, for example, may enable loading of thesoftware from one computer or processor into another, for example, froma management server or host computer into the computer platform of anapplication server. Thus, another type of media that may bear thesoftware elements includes optical, electrical and electromagneticwaves, such as used across physical interfaces between local devices,through wired and optical landline networks and over various air-links.The physical elements that carry such waves, such as wired or wirelesslinks, optical links or the like, also may be considered as mediabearing the software. As used herein, unless restricted tonon-transitory, tangible “storage” media, terms such as computer ormachine “readable medium” refer to any medium that participates inproviding instructions to a processor for execution.

Hence, a machine readable medium, such as computer-executable code, maytake many forms, including but not limited to, a tangible storagemedium, a carrier wave medium or physical transmission medium.Non-volatile storage media include, for example, optical or magneticdisks, such as any of the storage devices in any computer(s) or thelike, such as may be used to implement the databases, etc. shown in thedrawings. Volatile storage media include dynamic memory, such as mainmemory of such a computer platform. Tangible transmission media includecoaxial cables; copper wire and fiber optics, including the wires thatcomprise a bus within a computer system. Carrier-wave transmission mediamay take the form of electric or electromagnetic signals, or acoustic orlight waves such as those generated during radio frequency (RF) andinfrared (IR) data communications. Common forms of computer-readablemedia therefore include for example: a floppy disk, a flexible disk,hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD orDVD-ROM, any other optical medium, punch cards paper tape, any otherphysical storage medium with patterns of holes, a RAM, a ROM, a PROM andEPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wavetransporting data or instructions, cables or links transporting such acarrier wave, or any other medium from which a computer may readprogramming code and/or data. Many of these forms of computer readablemedia may be involved in carrying one or more sequences of one or moreinstructions to a processor for execution.

The computer system 101 can include or be in communication with anelectronic display that comprises a user interface (UI) for providing,for example, operational parameters of a charging station, and/orelectric vehicle. Examples of UI's include, without limitation, agraphical user interface (GUI) and web-based user interface.

In some embodiments, the system 101 includes a display to provide visualinformation to a user. In some embodiments, the display is a cathode raytube (CRT). In some embodiments, the display is a liquid crystal display(LCD). In further embodiments, the display is a thin film transistorliquid crystal display (TFT-LCD). In some embodiments, the display is anorganic light emitting diode (OLED) display. In various furtherembodiments, on OLED display is a passive-matrix OLED (PMOLED) oractive-matrix OLED (AMOLED) display. In some embodiments, the display isa plasma display. In other embodiments, the display is a videoprojector. In still further embodiments, the display is a combination ofdevices such as those disclosed herein.

In some embodiments, the system 101 includes an input device to receiveinformation from a user. In some embodiments, the input device is akeyboard. In some embodiments, the input device is a pointing deviceincluding, by way of non-limiting examples, a mouse, trackball, trackpad, joystick, game controller, or stylus. In some embodiments, theinput device is a touch screen or a multi-touch screen. In otherembodiments, the input device is a microphone to capture voice or othersound input. In other embodiments, the input device is a video camera tocapture motion or visual input. In still further embodiments, the inputdevice is a combination of devices such as those disclosed herein.

The system 101 can include or be operably coupled to one or moredatabases. The databases may comprise genomic, proteomic,pharmacogenomic, biomedical, and scientific databases. The databases maybe publicly available databases. Alternatively, or additionally, thedatabases may comprise proprietary databases. The databases may becommercially available databases. The databases include, but are notlimited to, MendelDB, PharmGKB, Varimed, Regulome, curated BreakSeqjunctions, Online Mendelian Inheritance in Man (OMIM), Human GenomeMutation Database (HGMD), NCBI dbSNP, NCBI RefSeq, GENCODE, GO (geneontology), and Kyoto Encyclopedia of Genes and Genomes (KEGG).

The methods disclosed herein may comprise analyzing one or moredatabases. The methods disclosed herein may comprise analyzing at leastabout 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,20, 30 or more databases. Analyzing the one or more databases maycomprise one or more algorithms, computers, processors, memorylocations, devices, or a combination thereof.

The methods disclosed herein may comprise producing one or more probesbased on data and/or information from one or more databases. The methodsdisclosed herein may comprise producing one or more probe sets based ondata and/or information from one or more databases. The methodsdisclosed herein may comprise producing one or more probes and/or probesets based on data and/or information from at least about 2 or moredatabases. The methods disclosed herein may comprise producing one ormore probes and/or probe sets based on data and/or information from atleast about 3 or more databases. The methods disclosed herein maycomprise producing one or more probes and/or probe sets based on dataand/or information from at least about 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14, 15, 16, 17, 18, 19, 20, 30 or more databases.

The methods disclosed herein may comprise identifying one or morenucleic acid regions based on data and/or information from one or moredatabases. The methods disclosed herein may comprise identifying one ormore sets of nucleic acid regions based on data and/or information fromone or more databases. The methods disclosed herein may compriseidentifying one or more nucleic acid regions and/or sets of nucleic acidregions based on data and/or information from at least about 2 or moredatabases. The methods disclosed herein may comprise identifying one ormore nucleic acid regions and/or sets of nucleic acid regions based ondata and/or information from at least about 3 or more databases. Themethods disclosed herein may comprise identifying one or more nucleicacid regions and/or sets of nucleic acid regions based on data and/orinformation from at least about 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,15, 16, 17, 18, 19, 20, 30 or more databases. The methods disclosedherein may further comprise producing one or more probes and/or probesets based on the identification of the one or more nucleic acid regionsand/or sets of nucleic acid regions.

The methods disclosed herein may comprise analyzing one or more resultsbased on data and/or information from one or more databases. The methodsdisclosed herein may comprise analyzing one or more sets of resultsbased on data and/or information from one or more databases. The methodsdisclosed herein may comprise analyzing one or more combined resultsbased on data and/or information from one or more databases. The methodsdisclosed herein may comprise analyzing one or more results, sets ofresults, and/or combined results based on data and/or information fromat least about 2 or more databases. The methods disclosed herein maycomprise analyzing one or more results, sets of results, and/or combinedresults based on data and/or information from at least about 3 or moredatabases. The methods disclosed herein may comprise analyzing one ormore results, sets of results, and/or combined results based on dataand/or information from at least about 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14, 15, 16, 17, 18, 19, 20, 30 or more databases.

The methods disclosed herein may comprise comparing one or more resultsbased on data and/or information from one or more databases. The methodsdisclosed herein may comprise comparing one or more sets of resultsbased on data and/or information from one or more databases. The methodsdisclosed herein may comprise comparing one or more combined resultsbased on data and/or information from one or more databases. The methodsdisclosed herein may comprise comparing one or more results, sets ofresults, and/or combined results based on data and/or information fromat least about 2 or more databases. The methods disclosed herein maycomprise comparing one or more results, sets of results, and/or combinedresults based on data and/or information from at least about 3 or moredatabases. The methods disclosed herein may comprise comparing one ormore results, sets of results, and/or combined results based on dataand/or information from at least about 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14, 15, 16, 17, 18, 19, 20, 30 or more databases.

The methods disclosed herein may comprise biomedical databases, genomicdatabases, biomedical reports, disease reports, case-control analysis,and rare variant discovery analysis based on data and/or informationfrom one or more databases, one or more assays, one or more data orresults, one or more outputs based on or derived from one or moreassays, one or more outputs based on or derived from one or more data orresults, or a combination thereof

Analysis

The systems and methods as disclosed herein may comprise, or comprisethe use of, one or more data, one or more data sets, one or morecombined data, one or more combined data sets, one or more results, oneor more sets of results, one or more combined results, or a combinationthereof. The data and/or results may be based on or derived from one ormore assays, one or more databases, or a combination thereof. Themethods and systems as disclosed herein may comprise, or comprise theuse of, analysis of the one or more data, one or more data sets, one ormore combined data, one or more combined data sets, one or more results,one or more sets of results, one or more combined results, or acombination thereof. The methods and systems as disclosed herein maycomprise, or comprise the use of, processing of the one or more data,one or more data sets, one or more combined data, one or more combineddata sets, one or more results, one or more sets of results, one or morecombined results, or a combination thereof

The systems and methods as disclosed herein may comprise, or comprisethe use of, at least one analysis and at least one processing of the oneor more data, one or more data sets, one or more combined data, one ormore combined data sets, one or more results, one or more sets ofresults, one or more combined results, or a combination thereof. Themethods and systems as disclosed herein may comprise, or comprise theuse of, one or more analyses and one or more processing of the one ormore data, one or more data sets, one or more combined data, one or morecombined data sets, one or more results, one or more sets of results,one or more combined results, or a combination thereof. The methods andsystems as disclosed herein may comprise, or comprise the use of, atleast 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90,100, 200, 300, 400, 500, 600, 700, 800, 900, 1000 or more distinctanalyses of the one or more data, one or more data sets, one or morecombined data, one or more combined data sets, one or more results, oneor more sets of results, one or more combined results, or a combinationthereof. The methods and systems as disclosed herein may comprise, orcomprise the use of, at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30,40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900,1000 or more distinct processing of the one or more data, one or moredata sets, one or more combined data, one or more combined data sets,one or more results, one or more sets of results, one or more combinedresults, or a combination thereof. The one or more analyses and/or oneor more processing may occur simultaneously, sequentially, or acombination thereof

The one or more analyses and/or one or more processing may occur over 1,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15, 20, 30, 40, 50, 60, 70, 80, 90,100, 200, 300, 400, 500, 600, 700, 800, 900, 1000 or time points. Thetime points may occur over a 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 55,60 or more hour period. The time points may occur over a 1, 2, 3, 4, 5,6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,25, 30, 35, 40, 45, 50, 55, 60 or more day period. The time points mayoccur over a 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 55, 60 or more weekperiod. The time points may occur over a 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40,45, 50, 55, 60 or more month period. The time points may occur over a 1,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,22, 23, 24, 25, 30, 35, 40, 45, 50, 55, 60 or more year period.

The methods and systems as disclosed herein may comprise, or comprisethe use of, one or more data. The one or more data may comprise one ormore raw data based on or derived from one or more assays. The one ormore data may comprise one or more raw data based on or derived from oneor more databases. The one or more data may comprise at least partiallyanalyzed data based on or derived from one or more raw data. The one ormore data may comprise at least partially processed data based on orderived from one or more raw data. The one or more data may comprisefully analyzed data based on or derived from one or more raw data. Theone or more data may comprise fully processed data based on or derivedfrom one or more raw data. The data may comprise sequencing read data orexpression data. The data may comprise biomedical, scientific,pharmacological, and/or genetic information.

The methods and systems as disclosed herein may comprise, or comprisethe use of, one or more combined data. The one or more combined data maycomprise two or more data. The one or more combined data may comprisetwo or more data sets. The one or more combined data may comprise one ormore raw data based on or derived from one or more assays. The one ormore combined data may comprise one or more raw data based on or derivedfrom one or more databases. The one or more combined data may compriseat least partially analyzed data based on or derived from one or moreraw data. The one or more combined data may comprise at least partiallyprocessed data based on or derived from one or more raw data. The one ormore combined data may comprise fully analyzed data based on or derivedfrom one or more raw data. The one or more combined data may comprisefully processed data based on or derived from one or more raw data. Oneor more combined data may comprise sequencing read data or expressiondata. One or more combined data may comprise biomedical, scientific,pharmacological, and/or genetic information.

The methods and systems as disclosed herein may comprise, or comprisethe use of, one or more data sets. The one or more data sets maycomprise one or more data. The one or more data sets may comprise one ormore combined data. The one or more data sets may comprise one or moreraw data based on or derived from one or more assays. The one or moredata sets may comprise one or more raw data based on or derived from oneor more databases. The one or more data sets may comprise at leastpartially analyzed data based on or derived from one or more raw data.The one or more data sets may comprise at least partially processed databased on or derived from one or more raw data. The one or more data setsmay comprise fully analyzed data based on or derived from one or moreraw data. The one or more data sets may comprise fully processed databased on or derived from one or more raw data. The data sets maycomprise sequencing read data or expression data. The data sets maycomprise biomedical, scientific, pharmacological, and/or geneticinformation.

The methods and systems as disclosed herein may comprise, or comprisethe use of, one or more combined data sets. The one or more combineddata sets may comprise two or more data. The one or more combined datasets may comprise two or more combined data. The one or more combineddata sets may comprise two or more data sets. The one or more combineddata sets may comprise one or more raw data based on or derived from oneor more assays. The one or more combined data sets may comprise one ormore raw data based on or derived from one or more databases. The one ormore combined data sets may comprise at least partially analyzed databased on or derived from one or more raw data. The one or more combineddata sets may comprise at least partially processed data based on orderived from one or more raw data. The one or more combined data setsmay comprise fully analyzed data based on or derived from one or moreraw data. The one or more combined data sets may comprise fullyprocessed data based on or derived from one or more raw data. Themethods and systems as disclosed herein may further comprise furtherprocessing and/or analysis of the combined data sets. One or morecombined data sets may comprise sequencing read data or expression data.One or more combined data sets may comprise biomedical, scientific,pharmacological, and/or genetic information.

The methods and systems as disclosed herein may comprise, or comprisethe use of, one or more results. The one or more results may compriseone or more data, data sets, combined data, and/or combined data sets.The one or more results may be based on or derived from one or moredata, data sets, combined data, and/or combined data sets. The one ormore results may be produced from one or more assays. The one or moreresults may be based on or derived from one or more assays. The one ormore results may be based on or derived from one or more databases. Theone or more results may comprise at least partially analyzed resultsbased on or derived from one or more data, data sets, combined data,and/or combined data sets. The one or more results may comprise at leastpartially processed results based on or derived from one or more data,data sets, combined data, and/or combined data sets. The one or moreresults may comprise fully analyzed results based on or derived from oneor more data, data sets, combined data, and/or combined data sets. Theone or more results may comprise fully processed results based on orderived from one or more data, data sets, combined data, and/or combineddata sets. The results may comprise sequencing read data or expressiondata. The results may comprise biomedical, scientific, pharmacological,and/or genetic information.

The methods and systems as disclosed herein may comprise, or comprisethe use of, one or more sets of results. The one or more sets of resultsmay comprise one or more data, data sets, combined data, and/or combineddata sets. The one or more sets of results may be based on or derivedfrom one or more data, data sets, combined data, and/or combined datasets. The one or more sets of results may be produced from one or moreassays. The one or more sets of results may be based on or derived fromone or more assays. The one or more sets of results may be based on orderived from one or more databases. The one or more sets of results maycomprise at least partially analyzed sets of results based on or derivedfrom one or more data, data sets, combined data, and/or combined datasets. The one or more sets of results may comprise at least partiallyprocessed sets of results based on or derived from one or more data,data sets, combined data, and/or combined data sets. The one or moresets of results may comprise at fully analyzed sets of results based onor derived from one or more data, data sets, combined data, and/orcombined data sets. The one or more sets of results may comprise fullyprocessed sets of results based on or derived from one or more data,data sets, combined data, and/or combined data sets. The sets of resultsmay comprise sequencing read data or expression data. The sets ofresults may comprise biomedical, scientific, pharmacological, and/orgenetic information.

The methods and systems as disclosed herein may comprise, or comprisethe use of, one or more combined results. The combined results maycomprise one or more results, sets of results, and/or combined sets ofresults. The combined results may be based on or derived from one ormore results, sets of results, and/or combined sets of results. The oneor more combined results may comprise one or more data, data sets,combined data, and/or combined data sets. The one or more combinedresults may be based on or derived from one or more data, data sets,combined data, and/or combined data sets. The one or more combinedresults may be produced from one or more assays. The one or morecombined results may be based on or derived from one or more assays. Theone or more combined results may be based on or derived from one or moredatabases. The one or more combined results may comprise at leastpartially analyzed combined results based on or derived from one or moredata, data sets, combined data, and/or combined data sets. The one ormore combined results may comprise at least partially processed combinedresults based on or derived from one or more data, data sets, combineddata, and/or combined data sets. The one or more combined results maycomprise fully analyzed combined results based on or derived from one ormore data, data sets, combined data, and/or combined data sets. The oneor more combined results may comprise fully processed combined resultsbased on or derived from one or more data, data sets, combined data,and/or combined data sets. The combined results may comprise sequencingread data or expression data. The combined results may comprisebiomedical, scientific, pharmacological, and/or genetic information.

The methods and systems as disclosed herein may comprise, or comprisethe use of, one or more combined sets of results. The combined sets ofresults may comprise one or more results, sets of results, and/orcombined results. The combined sets of results may be based on orderived from one or more results, sets of results, and/or combinedresults. The one or more combined sets of results may comprise one ormore data, data sets, combined data, and/or combined data sets. The oneor more combined sets of results may be based on or derived from one ormore data, data sets, combined data, and/or combined data sets. The oneor more combined sets of results may be produced from one or moreassays. The one or more combined sets of results may be based on orderived from one or more assays. The one or more combined sets ofresults may be based on or derived from one or more databases. The oneor more combined sets of results may comprise at least partiallyanalyzed combined sets of results based on or derived from one or moredata, data sets, combined data, and/or combined data sets. The one ormore combined sets of results may comprise at least partially processedcombined sets of results based on or derived from one or more data, datasets, combined data, and/or combined data sets. The one or more combinedsets of results may comprise fully analyzed combined sets of resultsbased on or derived from one or more data, data sets, combined data,and/or combined data sets. The one or more combined sets of results maycomprise fully processed combined sets of results based on or derivedfrom one or more data, data sets, combined data, and/or combined datasets. The combined sets of results may comprise sequencing read data orexpression data. The combined sets of results may comprise biomedical,scientific, pharmacological, and/or genetic information.

The methods and systems as disclosed herein may comprise, or comprisethe use of, one or more outputs, sets of outputs, combined outputs,and/or combined sets of outputs. The methods, libraries, kits, andsystems herein may comprise producing one or more outputs, sets ofoutputs, combined outputs, and/or combined sets of outputs. The sets ofoutputs may comprise one or more outputs, one or more combined outputs,or a combination thereof. The combined outputs may comprise one or moreoutputs, one or more sets of outputs, one or more combined sets ofoutputs, or a combination thereof. The combined sets of outputs maycomprise one or more outputs, one or more sets of outputs, one or morecombined outputs, or a combination thereof. The one or more outputs,sets of outputs, combined outputs, and/or combined sets of outputs maybe based on or derived from one or more data, one or more data sets, oneor more combined data, one or more combined data sets, one or moreresults, one or more sets of results, one or more combined results, or acombination thereof. The one or more outputs, sets of outputs, combinedoutputs, and/or combined sets of outputs may be based on or derived fromone or more databases. The one or more outputs, sets of outputs,combined outputs, and/or combined sets of outputs may comprise one ormore biomedical reports, biomedical outputs, rare variant outputs,pharmacogenetic outputs, population study outputs, case-control outputs,biomedical databases, genomic databases, disease databases, net content.

The methods and systems as disclosed herein may comprise, or comprisethe use of, one or more biomedical outputs, one or more sets ofbiomedical outputs, one or more combined biomedical outputs, one or morecombined sets of biomedical outputs. The methods, libraries, kits andsystems herein may comprise producing one or more biomedical outputs,one or more sets of biomedical outputs, one or more combined biomedicaloutputs, one or more combined sets of biomedical outputs. The sets ofbiomedical outputs may comprise one or more biomedical outputs, one ormore combined biomedical outputs, or a combination thereof. The combinedbiomedical outputs may comprise one or more biomedical outputs, one ormore sets of biomedical outputs, one or more combined sets of biomedicaloutputs, or a combination thereof. The combined sets of biomedicaloutputs may comprise one or more biomedical outputs, one or more sets ofbiomedical outputs, one or more combined biomedical outputs, or acombination thereof. The one or more biomedical outputs, one or moresets of biomedical outputs, one or more combined biomedical outputs, oneor more combined sets of biomedical outputs may be based on or derivedfrom one or more data, one or more data sets, one or more combined data,one or more combined data sets, one or more results, one or more sets ofresults, one or more combined results, one or more outputs, one or moresets of outputs, one or more combined outputs, one or more sets ofcombined outputs, or a combination thereof. The one or more biomedicaloutputs may comprise biomedical information of a subject. The biomedicalinformation of the subject may predict, diagnose, and/or prognose one ormore biomedical features. The one or more biomedical features maycomprise the status of a disease or condition, genetic risk of a diseaseor condition, reproductive risk, genetic risk to a fetus, risk of anadverse drug reaction, efficacy of a drug therapy, prediction of optimaldrug dosage, transplant tolerance, or a combination thereof.

The methods and systems as disclosed herein may comprise, or comprisethe use of, one or more biomedical reports. The methods, libraries,kits, and systems herein may comprise producing one or more biomedicalreports. The one or more biomedical reports may be based on or derivedfrom one or more data, one or more data sets, one or more combined data,one or more combined data sets, one or more results, one or more sets ofresults, one or more combined results, one or more outputs, one or moresets of outputs, one or more combined outputs, one or more sets ofcombined outputs, one or more biomedical outputs, one or more sets ofbiomedical outputs, combined biomedical outputs, one or more sets ofbiomedical outputs, or a combination thereof. The biomedical report maypredict, diagnose, and/or prognose one or more biomedical features. Theone or more biomedical features may comprise the status of a disease orcondition, genetic risk of a disease or condition, reproductive risk,genetic risk to a fetus, risk of an adverse drug reaction, efficacy of adrug therapy, prediction of optimal drug dosage, transplant tolerance,or a combination thereof.

The methods and systems as disclosed herein may also comprise, orcomprise the use of, the transmission of one or more data, information,results, outputs, reports or a combination thereof. For example,data/information based on or derived from the one or more assays aretransmitted to another device and/or instrument. In another example, thedata, results, outputs, biomedical outputs, biomedical reports, or acombination thereof are transmitted to another device and/or instrument.The information obtained from an algorithm may also be transmitted toanother device and/or instrument. Information based on the analysis ofone or more databases may be transmitted to another device and/orinstrument. Transmission of the data/information may comprise thetransfer of data/information from a first source to a second source. Thefirst and second sources may be in the same approximate location (e.g.,within the same room, building, block, campus). Alternatively, first andsecond sources may be in multiple locations (e.g., multiple cities,states, countries, continents, etc). The data, results, outputs,biomedical outputs, biomedical reports can be transmitted to a patientand/or a healthcare provider.

Transmission may be based on the analysis of one or more data, results,information, databases, outputs, reports, or a combination thereof. Forexample, transmission of a second report is based on the analysis of afirst report. Alternatively, transmission of a report is based on theanalysis of one or more data or results. Transmission may be based onreceiving one or more requests. For example, transmission of a reportmay be based on receiving a request from a user (e.g., patient,healthcare provider, individual).

Transmission of the data/information may comprise digital transmissionor analog transmission. Digital transmission may comprise the physicaltransfer of data (a digital bit stream) over a point-to-point orpoint-to-multipoint communication channel. Examples of such channels arecopper wires, optical fibers, wireless communication channels, andstorage media. The data may be represented as an electromagnetic signal,such as an electrical voltage, radiowave, microwave, or infrared signal.

Analog transmission may comprise the transfer of a continuously varyinganalog signal. The messages can either be represented by a sequence ofpulses by means of a line code (baseband transmission), or by a limitedset of continuously varying wave forms (passband transmission), using adigital modulation method. The passband modulation and correspondingdemodulation (also known as detection) can be carried out by modemequipment. According to the most common definition of digital signal,both baseband and passband signals representing bit-streams areconsidered as digital transmission, while an alternative definition onlyconsiders the baseband signal as digital, and passband transmission ofdigital data as a form of digital-to-analog conversion.

The methods and systems as disclosed herein may comprise, or comprisethe use of, one or more sample identifiers. The sample identifiers maycomprise labels, barcodes, and other indicators which can be linked toone or more samples and/or subsets of nucleic acid molecules. Themethods disclosed herein may comprise one or more processors, one ormore memory locations, one or more computers, one or more monitors, oneor more computer software, one or more algorithms for linking data,results, outputs, biomedical outputs, and/or biomedical reports to asample.

The methods and systems as disclosed herein may comprise, or comprisethe use of, a processor for correlating the expression levels of one ormore nucleic acid molecules with a prognosis of disease outcome. Themethods disclosed herein may comprise one or more of a variety ofcorrelative techniques, including lookup tables, algorithms,multivariate models, and linear or nonlinear combinations of expressionmodels or algorithms. The expression levels may be converted to one ormore likelihood scores, reflecting a likelihood that the patientproviding the sample may exhibit a particular disease outcome. Themodels and/or algorithms can be provided in machine readable format andcan optionally further designate a treatment modality for a patient orclass of patients.

In some cases at least a portion of the results or the outputs, areentered into a database for access by representatives or agents of asequencing business, the individual, a medical provider, or insuranceprovider. In some cases outputs include polymorphism classification,identification, or diagnosis by a representative, agent or consultant ofthe business, such as a medical professional. In other cases, a computeror algorithmic analysis of the data is provided automatically. In somecases the sequencing business may bill the individual, insuranceprovider, medical provider, researcher, or government entity for one ormore of the following: sequencing assays performed, consulting services,data analysis, reporting of results, or database access.

In some embodiments of the present invention, at least a portion of theoutput, or the combined data, is presented or displayed as a report on acomputer screen or as a paper record. In some embodiments, the output isan electronic report. In some cases, the output is displayed in numericand/or graphical form. For example without limitation, at least aportion of the output of the combined data is displayed on a graphicaluser interface of an electronic display coupled to the computerprocessor. In some cases, the output, display or report may include, butis not limited to, such information as one or more of the following: thenumber of copy number variation identified, the suitability of theoriginal sample, the number of genes showing different polymorphisms,one or more haplotypes, a diagnosis, a statistical confidence for thediagnosis, the likelihood of a specific condition or disorder, andindicated therapies.

Performance

The methods and systems disclosed herein can detect one or more genomicregions (i.e., copy number variation, or one or more polymorphisms) witha specificity or sensitivity of about or greater than about 50%, 55%,60%, 65%, 70%, 75%, 76%, 77%, 78%, 79%, 80%, 85%, 86%, 87%, 88%, 89%,90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 99.5%, or apositive predictive value or negative predictive value of about or atleast about 80%, 85%, 90%, 95%, 95.5%, 96%, 96.5%, 97%, 97.5%, 98%,98.5%, 99%, 99.5% or more. The methods and systems disclosed herein candetect one or more genomic regions (i.e., copy number variation, or oneor more polymorphisms) with a specificity or sensitivity of about orgreater than about 50%. The methods and systems disclosed herein candiagnose a specific condition based on the detected genomic regions suchas copy number variation. The methods and systems can diagnose aspecific condition with a specificity or sensitivity of greater than50%, 55%, 60%, 65%, 70%, 75%, 76%, 77%, 78%, 79%, 80%, 85%, 86%, 87%,88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 99.5%, ora positive predictive value or negative predictive value of at least80%, 85%, 90%, 95%, 95.5%, 96%, 96.5%, 97%, 97.5%, 98%, 98.5%, 99%,99.5% or more.

The methods and systems disclosed herein may increase the sensitivity orspecificity when compared to the sensitivity or specificity of currentsequencing methods. For example without limitation, in some embodiments,the combined whole exome sequencing and a whole genome sequencingreactions may increase the sensitivity or specificity in detecting oneor more copy number variations or diagnosing a specific condition whencompared to the sensitivity or specificity of whole exome sequencingalone. The sensitivity or specificity of the methods and systems asdescribed herein may increase by at least about 1%, 2%, 3%, 4%, 5%,5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, 10%, 10.5%, 11%, 12%, 13%,14%, 15%, 16%, 17%, 18%, 19%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%,60%, 70%, 80%, 90%, 95%., 97% or more. The sensitivity or specificity ofthe methods and systems as described herein may increase by at leastabout 4.5-20%, about 5-15%, about 7%-12%, or about 8%-10%. In someembodiments, the methods and systems disclosed herein may have a similarsensitivity or specificity when compared to the sensitivity orspecificity of a high coverage whole genome sequencing alone.

In some embodiments, the methods and systems as described hereincomprise combining an untargeted sequencing data (e.g., low coveragewhole genome sequencing data) and one or more target-specific sequencingdata. The methods and system disclosed herein may have a sensitivity,specificity, positive predictive value or negative predictive value thatis similar to a high coverage whole genome sequencing data alone. Thesensitivity specificity, positive predictive value or negativepredictive value may be for the detection of one or more haplotypes,SNV, CNV or one or more polymorphisms. In some embodiments, the methodsand systems as disclosed herein comprising untargeted sequencing data(e.g., a low coverage whole genome sequencing data) that may have asensitivity, specificity, positive predictive value or negativepredictive value that is less than 5%, 10%, 15%, 20%, 25%, 30%, 35%,40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85% or 90% for one or moreSNV. In some embodiments, the methods and systems as disclosed hereinmay have a sensitivity, specificity, positive predictive value ornegative predictive value that is less than 5%, 10%, 15%, 20%, 25%, 30%,35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85% or 90% for one ormore polymorphisms, specific genes or genomic regions. In someembodiments, the untargeted sequencing (e.g., whole genome sequencing)in the methods and systems as disclosed herein may have a sensitivity,specificity, positive predictive value or negative predictive value thatis less than 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%,65%, 70%, 75%, 80%, 85% or 90% for one or more SNV, one or morepolymorphisms or one or more specific genes or genomic regions. In someembodiments, the target-specific sequencing data may have a sensitivity,specificity, positive predictive value or negative predictive value thatis about, at least about or less than about 50%, 55%, 60%, 65%, 70%,75%, 76%, 77%, 78%, 79%, 80%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%,93%, 94%, 95%, 96%, 97%, 98%, 99%, 99.5% or 100%. In some cases, theuntargeted sequencing can have a sensitivity, specificity, positivepredictive value or negative predictive value that is between about 50%to 80%.

The methods and systems disclosed herein can detect one or more genomicregions (i.e., copy number variation, or one or more polymorphisms) withan error rate of less than 0.5%, 1%, 1.5%, 2%, 2.5%, 3%, 3.5%, 4%, 4.5%,5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, 10% or less. Themethods and systems disclosed herein can diagnose a specific conditionbased on the detected genomic regions such as copy number variation. Themethods and systems can diagnose a specific condition with a a errorrate of less than 0.5%, 1%, 1.5%, 2%, 2.5%, 3%, 3.5%, 4%, 4.5%, 5%,5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, 10% or less.

The percent error of the methods and systems as described herein may besimilar to current sequencing methods. For example without limitation,in some embodiments, the combined whole exome sequencing and a wholegenome sequencing reactions may have a percent error rate in detectingone or more copy number variations or diagnosing a specific conditionwhen compared to the sensitivity of whole exome sequencing alone. Thecurrent sequencing methods may be a high coverage whole genomesequencing alone. The percent error rate of the methods and systems asdescribed herein may be within about 0.001%, 0.002%, 0.003%, 0.004%,0.005%, 0.006%, 0.007%, 0.008%, 0.009%, 0.01%, 0.02%, 0.03%, 0.04%,0.05%, 0.06%, 0.07%, 0.08%, 0.09%, 1%, 1.1%., 1.2%, 1.3%, 1.4%, 1.5%,1.6%, 1.7%, 1.8%, 1.9%, or 2% of the current sequencing methods. Thepercent error rate of the methods and systems as described herein may beless than the percent error rate of current sequencing methods. Thepercent error rate of the methods and systems as described herein may beat least about 10%, 9,%, 8%, 7%, 6%, 5%, 4%, 3%, 2%, 1.75%, 1.5%, 1.25%,1%, 0.9%, 0.8%, 0.7%, 0.6%, 0.5%, 0.4%, 0.3%, 0.2%, 0.1% less than thepercent error rate of current sequencing methods. The percent error rateof the methods and systems as described herein may be less than about2%, 1.75%, 1.5%, 1.25%, 1%, 0.75%, 0.50%, 0.25%, 0.10%, 0.075%, 0.050%,0.025%, or 0.001%. In some embodiments, the methods and systemsdisclosed herein may have a similar percent error rate when compared tothe sensitivity or specificity of a high coverage whole genomesequencing alone.

The error of the methods and systems as described herein can bedetermined as a Phred quality score. The Phred quality score may beassigned to each base call in automated sequencer traces and may be usedto compare the efficacy of different sequencing methods. The Phredquality score (Q) may be defined as a property which is logarithmicallyrelated to the base-calling error probabilities (P). The Phred qualityscore (Q) may be calculated as Q=−10 log 10 P. The Phred quality scoreof the methods and systems as described herein may be similar to thePhred quality score of current sequencing methods. For example withoutlimitation, in some embodiments, the combined whole exome sequencing anda low coverage whole genome sequencing reactions may have a similarPhred quality score in detecting one or more copy number variations ordiagnosing a specific condition when compared to the Phred quality scoreof whole exome sequencing alone or a high coverage whole genomesequencing alone. The Phred quality score of the methods and systems asdescribed herein may be within 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 of thePhred quality score of the methods and systems as described herein. ThePhred quality score of the methods and systems as described herein maybe less than the Phred quality score of the methods and systems asdescribed herein. The Phred quality score of the methods and systems asdescribed herein may be at least about 10, 9, 8, 7, 6, 5, 4, 3, 2, 1less than the Phred quality score of the methods and systems asdescribed herein. The Phred quality score of the methods and systems asdescribed herein may be greater than 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,15, 16, 17, 18, 19, 20, 25, or 30. The Phred quality score of themethods and systems as described herein may be greater than 35, 40, 41,42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58,59,or 60. The Phred quality score of the methods and systems asdescribed herein may be at least 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60 or more.

The accuracy of the one or more sequencing reactions may be similar tocurrent sequencing methods in detecting and identifying one or morespecific genomic regions. The current sequencing methods can be a wholeexome sequencing alone or a high coverage whole genome sequencing alone.The accuracy of the methods and systems as described herein may bewithin about 0.001%, 0.002%, 0.003%, 0.004%, 0.005%, 0.006%, 0.007%,0.008%, 0.009%, 0.01%, 0.02%, 0.03%, 0.04%, 0.05%, 0.06%, 0.07%, 0.08%,0.09%, 1%, 1.1%., 1.2%, 1.3%, 1.4%, 1.5%, 1.6%, 1.7%, 1.8%, 1.9%, 2%,2.25%, 2.5%, 2.75%, 3%, 3.25%, 3.5%, 3.75%, or 4% of the currentsequencing methods. The accuracy of the the methods and systems asdescribed herein may be greater than the accuracy of current sequencingmethods. The accuracy of the methods and systems as described herein maybe at least about 0.001%, 0.002%, 0.003%, 0.004%, 0.005%, 0.006%,0.007%, 0.008%, 0.009%, 0.01%, 0.02%, 0.03%, 0.04%, 0.05%, 0.06%, 0.07%,0.08%, 0.09%, 1%, 1.1%., 1.2%, 1.3%, 1.4%, 1.5%, 1.6%, 1.7%, 1.8%, 1.9%,2%, 2.25%, 2.5%, 2.75%, 3%, 3.25%, 3.5%, 3.75%, 4%, 4.5%, 5%, 6%, 7%,8%, 9%, 10%, 11%, 12%, 15%, 17%, 20%, 25%, 30%, 35%, 40%, 50%, or 60%greater than the accuracy of current sequencing methods. The accuracy ofthe methods and systems as described herein may be greater than about70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98.25%,98.5%, 98.75%, 99%, 99.25%, 99.5%, or 99.75%. The accuracy of themethods and systems as described herein may be greater than about 99.1%,99.2%, 99.3%, 99.4%, 99.5%, 99.6%, 99.7%, 99.8%, 99.9%, 99.99%, or90.999%.

The methods and systems disclosed herein can generate an output dataidentifying one or more specific genomic regions (i.e., copy numbervariation, or one or more polymorphisms) in a shorter time than a highcoverage whole genome sequencing alone. In some embodiments, the methodsand systems as described herein can identify specific genomic regions inless than 1 month, 3.5 weeks, 3 weeks, 2.5 weeks, 2 weeks, 1.5 weeks or1 week. In some embodiments, the methods and systems as described hereincan identify specific genomic regions in less than 6, 5, 4, 3, 2 or 1days. In some embodiments, the methods and systems as described hereincan identify specific genomic regions in less than 24, 23, 22, 21, 20,19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2 or 1hours. In some embodiments, the methods and systems as described hereincan identify specific genomic regions in less than 60, 59, 55, 50, 45,40, 35, 30, 25, 20, 15, 10, 9, 8, 7, 6, 5, 4, 3, 2 or 1 minutes. In someembodiments, the methods and systems as described herein can identifyspecific genomic regions in less than 10 minutes. In some embodiments,the methods and systems as described herein can identify specificgenomic regions in less than 5 minutes.

The methods and systems disclosed herein can generate an output dataidentifying one or more specific genomic regions (i.e., copy numbervariation, or one or more polymorphisms) more economically or using lessreagents than a high coverage whole genome sequencing alone. In someembodiments, the methods and systems as described herein can identifyspecific genomic regions with 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%,45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90% less financial chargesto the customers or less reagents for sequencing reactions used. In someembodiments, a low coverage whole genome sequencing data costs about, orless than about 500, 450, 400, 350, 300, 250, 200, 150, 100 U.S.dollars.

Samples

The methods and systems as disclosed herein may comprise, or comprisethe use of, one or more samples. The samples may be nucleic acid samplescomprising one or more nucleic acid molecules. The methods and systemsas disclosed herein may comprise, or comprise the use of, 1, 2, 3, 4, 5,6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60,65, 70, 75, 80, 85, 90, 95, 100 or more samples. The methods and systemsas described herein may comprise receiving and combining one or moredata inputs from one or more samples. The one or more samples can be thesame or different, or a combination thereof. A nucleic acid sample canbe partitioned into a plurality of partitions in order to generatedifferent sequencing data. In some embodiments, the first nucleic acidsample and the second nucleic acid sample are the same sample. Thesample may be derived from a subject. The two or more samples may bederived from a single subject. The two or more samples may be derivedfrom 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25, 30, 35, 40,45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100 or more differentsubjects. The subject may be a mammal, reptile, amphibian, avian, orfish. The mammal may be a human, ape, orangutan, monkey, chimpanzee,cow, pig, horse, rodent, bird, reptile, dog, cat, or other animal. Areptile may be a lizard, snake, alligator, turtle, crocodile, andtortoise. An amphibian may be a toad, frog, newt, and salamander.Examples of avians include, but are not limited to, ducks, geese,penguins, ostriches, and owls. Examples of fish include, but are notlimited to, catfish, eels, sharks, and swordfish. Preferably, thesubject is a human. The subject may suffer from a disease or condition.

The two or more samples may be collected over 1, 2, 3, 4, 5, 6, 7, 8, 9,10, 11, 12, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500,600, 700, 800, 900, 1000 or time points. The time points may occur overa 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 55, 60 or more hour period. Thetime points may occur over a 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 55,60 or more day period. The time points may occur over a 1, 2, 3, 4, 5,6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,25, 30, 35, 40, 45, 50, 55, 60 or more week period. The time points mayoccur over a 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 40, 45, 50, 55, 60 or more monthperiod.

The sample may be from a body fluid, cell, skin, tissue, organ, orcombination thereof. The sample may be a blood, plasma, a bloodfraction, saliva, sputum, urine, semen, transvaginal fluid,cerebrospinal fluid, stool, a cell or a tissue biopsy. The sample may befrom an adrenal gland, appendix, bladder, brain, ear, esophagus, eye,gall bladder, heart, kidney, large intestine, liver, lung, mouth,muscle, nose, pancreas, parathyroid gland, pineal gland, pituitarygland, skin, small intestine, spleen, stomach, thymus, thyroid gland,trachea, uterus, vermiform appendix, cornea, skin, heart valve, artery,or vein.

The samples may comprise one or more nucleic acid molecules. The nucleicacid molecule may be a DNA molecule, RNA molecule (e.g., mRNA, cRNA ormiRNA), or DNA/RNA hybrid. Examples of DNA molecules include, but arenot limited to, double-stranded DNA, single-stranded DNA,single-stranded DNA hairpins, cDNA, genomic DNA. The nucleic acid may bean RNA molecule, such as a double-stranded RNA, single-stranded RNA,ncRNA, RNA hairpin, and mRNA. Examples of ncRNA include, but are notlimited to, siRNA, miRNA, snoRNA, piRNA, tiRNA, PASR, TASR, aTASR,TSSa-RNA, snRNA, RE-RNA, uaRNA, x-ncRNA, hY RNA, usRNA, snaR, and vtRNA.

Nucleic Acid Samples

Methods and systems of the present disclosure can be easily applied toany type of nucleic acid sample. In some embodiments, the nucleic acidsamples can be fragmented double stranded DNA including but not limitedto, for example, free DNA isolated from plasma, serum, and/or urine; DNAfrom apoptotic cells and/or tissues; DNA fragmented enzymatically invitro (for example, by DNase I and/or restriction endonuclease); and/orDNA fragmented by mechanical forces (hydro-shear, sonication,nebulization, etc.). Additional suitable methods and compositions ofproducing nucleic acid molecules comprising stem-loop oligonucleotidesare further described in detail in U.S. Pat. No. 7,803,550, which isherein incorporated by reference in its entirety.

In other embodiments, methods and systems provided herein can be easilyapplied to any high molecular weight double stranded DNA including, forexample, DNA isolated from tissues, cell culture, bodily fluids, animaltissue, plant, bacteria, fungi, viruses, etc.

Nucleic acid obtained from biological samples typically is fragmented toproduce suitable fragments for analysis. Template nucleic acids may befragmented or sheared to desired length, using a variety of mechanical,chemical and/or enzymatic methods. DNA may be randomly sheared viasonication, e.g., Covaris method, brief exposure to a DNase, or using amixture of one or more restriction enzymes, or a transposase or nickingenzyme. RNA may be fragmented by brief exposure to an RNase, heat plusmagnesium, or by shearing. The RNA may be converted to cDNA. Iffragmentation is employed, the RNA may be converted to cDNA before orafter fragmentation. In one embodiment, nucleic acid from a biologicalsample is fragmented by sonication. In another embodiment, nucleic acidis fragmented by a hydroshear instrument. Generally, individual nucleicacid template molecules can be from about 2 kb to about 40 kb. In aparticular embodiment, nucleic acids are about 6 kb-10 kb fragments.Nucleic acid molecules may be single-stranded, double-stranded, ordouble-stranded with single-stranded regions (for example, stem- andloop-structures).

A biological sample as described herein may be homogenized orfractionated in the presence of a detergent or surfactant. Theconcentration of the detergent in the buffer may be about 0.05% to about10.0%. The concentration of the detergent can be up to an amount wherethe detergent remains soluble in the solution. In one embodiment, theconcentration of the detergent is between 0.1% to about 2%. Thedetergent, particularly a mild one that is nondenaturing, can act tosolubilize the sample. Detergents may be ionic or nonionic. Examples ofnonionic detergents include triton, such as the Triton® X series(Triton® X-100 t-Oct-C6H4-(OCH2-CH2)xOH, x=9-10, Triton® X-100R, Triton®X-114 x=7-8), octyl glucoside, polyoxyethylene(9)dodecyl ether,digitonin, IGEPAL® CA630 octylphenyl polyethylene glycol,n-octyl-beta-D-glucopyranoside (betaOG), n-dodecyl-beta, Tween® 20polyethylene glycol sorbitan monolaurate, Tween® 80 polyethylene glycolsorbitan monooleate, polidocanol, n-dodecyl beta-D-maltoside (DDM),NP-40 nonylphenyl polyethylene glycol, C12E8 (octaethylene glycoln-dodecyl monoether), hexaethyleneglycol mono-n-tetradecyl ether(C14EO6), octyl-beta-thioglucopyranoside (octyl thioglucoside, OTG),Emulgen, and polyoxyethylene 10 lauryl ether (C12E10). Examples of ionicdetergents (anionic or cationic) include deoxycholate, sodium dodecylsulfate (SDS), N-lauroylsarcosine, and cetyltrimethylammoniumbromide(CTAB). A zwitterionic reagent may also be used in the purificationschemes of the present disclosure, such as Chaps, zwitterion 3-14, and3-[(3-cholamidopropyl)dimethylammonio]-1-propanesulf-onate. It iscontemplated also that urea may be added with or without anotherdetergent or surfactant.

Lysis or homogenization solutions may further contain other agents, suchas reducing agents. Examples of such reducing agents includedithiothreitol (DTT), .beta.-mercaptoethanol, DTE, GSH, cysteine,cysteamine, tricarboxyethyl phosphine (TCEP), or salts of sulfurousacid.

The methods and systems as disclosed herein may comprise, or comprisethe use of, one or more subsets of nucleic acid molecules. The subsetsof nucleic acid molecules may be derived from a nucleic acid sample. Thesubsets of nucleic acid molecules may be derived from the same nucleicacid sample. Alternatively, or additionally, the subsets of nucleic acidmolecules are derived from two or more different nucleic acid samples.Two or more subsets of nucleic acid molecules may be differentiated bytheir nucleic acid content. The one or more subsets of nucleic acidmolecules may comprise one or more nucleic acid molecules or a variantor derivative thereof. For example, the two or more subsets of nucleicacid molecules may comprise nucleic acids comprising different GCcontent, nucleic acid size, genomic regions, genomic region features,eluted nucleic acid molecules, hybridized nucleic acid molecules,non-hybridized nucleic acid molecules, amplified nucleic acid molecules,non-amplified nucleic acid molecules, supernatant-derived nucleic acidmolecules, eluant-derived nucleic acid molecules, labeled nucleic acidmolecules, non-labeled nucleic acid molecules, capture probe hybridizednucleic acid molecules, capture probe free nucleic acid molecules, beadbound nucleic acid molecules, bead free nucleic acid molecules, or acombination thereof. The two or more subsets of nucleic acid moleculesmay be differentiated by GC content, nucleic acid size, genomic regions,capture probes, beads, labels, or a combination thereof.

The methods and systems as disclosed herein may comprise, or comprisethe use of, combining two or more subsets of nucleic acid molecules toproduce a combined subset of nucleic acid molecules. The combinedsubsets of nucleic acid molecules may be derived from a nucleic acidsample. The combined subsets of nucleic acid molecules may be derivedfrom the same nucleic acid sample. Alternatively, or additionally, thecombined subsets of nucleic acid molecules are derived from two or moredifferent nucleic acid samples. Two or more combined subsets of nucleicacid molecules may be differentiated by their nucleic acid content. Theone or more combined subsets of nucleic acid molecules may comprise oneor more nucleic acid molecules or a variant or derivative thereof. Forexample, the two or more combined subsets of nucleic acid molecules maycomprise nucleic acids comprising different GC content, nucleic acidsize, genomic regions, genomic region features, eluted nucleic acidmolecules, hybridized nucleic acid molecules, non-hybridized nucleicacid molecules, amplified nucleic acid molecules, non-amplified nucleicacid molecules, supernatant-derived nucleic acid molecules,eluant-derived nucleic acid molecules, labeled nucleic acid molecules,non-labeled nucleic acid molecules, capture probe hybridized nucleicacid molecules, capture probe free nucleic acid molecules, bead boundnucleic acid molecules, bead free nucleic acid molecules, or acombination thereof. The two or more combined subsets of nucleic acidmolecules may be differentiated by GC content, nucleic acid size,genomic regions, capture probes, beads, labels, or a combinationthereof.

Subsets of nucleic acid molecules may comprise one or more genomicregions as disclosed herein. Subsets of nucleic acid molecules maycomprise 1 or more, 2 or more, 3 or more, 4 or more, 5 or more, 6 ormore, 7 or more, 8 or more, 9 or more, 10 or more, 11 or more, 12 ormore, 13 or more, 14 or more, 15 or more, 20 or more, 25 or more, 30 ormore, 35 or more, 40 or more, 50 or more, 60 or more, 70 or more, 80 ormore, 90 or more, or 100 or more genomic regions. The one or moregenomic regions may be identical, similar, different, or a combinationthereof.

Subsets of nucleic acid molecules may comprise one or more genomicregion features as disclosed herein. Subsets of nucleic acid moleculesmay comprise 1 or more, 2 or more, 3 or more, 4 or more, 5 or more, 6 ormore, 7 or more, 8 or more, 9 or more, 10 or more, 11 or more, 12 ormore, 13 or more, 14 or more, 15 or more, 20 or more, 25 or more, 30 ormore, 35 or more, 40 or more, 50 or more, 60 or more, 70 or more, 80 ormore, 90 or more, or 100 or more genomic region features. The one ormore genomic region features may be identical, similar, different, or acombination thereof.

Subsets of nucleic acid molecules may comprise nucleic acid molecules ofdifferent sizes. The length of a nucleic acid molecule in a subset ofnucleic acid molecules may be referred to as the size of the nucleicacid molecule. The average length of the nucleic acid molecules in asubset of nucleic acid molecules may be referred to as the mean size ofnucleic acid molecules. As used herein, the terms “size of a nucleicacid molecule”, “mean size of nucleic acid molecules”, “molecular size”and “mean molecular size” may be used interchangeably. The size of anucleic acid molecule may be used to differentiate two or more subsetsof nucleic acid molecules. The difference in the mean size of nucleicacid molecules in a subset of nucleic acid molecules and the mean sizeof nucleic acid molecules in another subset of nucleic acid moleculesmay be used to differentiate the two subsets of nucleic acid molecules.The mean size of nucleic acid molecules in one subset of nucleic acidmolecules may be greater than the mean size of nucleic acid molecules inat least one other subset of nucleic acid molecules. The mean size ofnucleic acid molecules in one subset of nucleic acid molecules may beless than the mean size of nucleic acid molecules in at least one othersubset of nucleic acid molecules. The difference in mean molecular sizebetween two or more subsets of nucleic acid molecules may be at leastabout 50; 75; 100; 125; 150; 175; 200; 225; 250; 275; 300; 350; 400;450; 500; 550; 600; 650; 700; 750; 800; 850; 900; 950; 1,000; 1100;1200; 1300; 1400; 1500; 1600; 1700; 1800; 1900; 2,000; 3,000; 4,000;5,000; 6,000; 7,000; 8,000; 9,000; 10,000; 15,000; 20,000; 30,000;40,000; 50,000; 60,000; 70,000; 80,000; 90,000; 100,000 or more bases orbasepairs. In some aspects of the disclosure, the difference in meanmolecular size between two or more subsets of nucleic acid molecules isat least about 200 bases or bases pairs. Alternatively, the differencein mean molecular size between two or more subsets of nucleic acidmolecules is at least about 300 bases or bases pairs.

Subsets of nucleic acid molecules may comprise nucleic acid molecules ofdifferent sequencing sizes. The length of a nucleic acid molecule in asubset of nucleic acid molecules to be sequenced may be referred to asthe sequencing size of the nucleic acid molecule. The average length ofthe nucleic acid molecules in a subset of nucleic acid molecules may bereferred to as the mean sequencing size of nucleic acid molecules. Asused herein, the terms “sequencing size of a nucleic acid molecule”,“mean sequencing size of nucleic acid molecules”, “molecular sequencingsize” and “mean molecular sequencing size” may be used interchangeably.The mean molecular sequencing size of one or more subsets of nucleicacid molecules may be at least about 50; 75; 100; 125; 150; 175; 200;225; 250; 275; 300; 350; 400; 450; 500; 550; 600; 650; 700; 750; 800;850; 900; 950; 1,000; 1100; 1200; 1300; 1400; 1500; 1600; 1700; 1800;1900; 2,000; 3,000; 4,000; 5,000; 6,000; 7,000; 8,000; 9,000; 10,000;15,000; 20,000; 30,000; 40,000; 50,000; 60,000; 70,000; 80,000; 90,000;100,000 or more bases or basepairs. The sequencing size of a nucleicacid molecule may be used to differentiate two or more subsets ofnucleic acid molecules. The difference in the mean sequencing size ofnucleic acid molecules in a subset of nucleic acid molecules and themean sequencing size of nucleic acid molecules in another subset ofnucleic acid molecules may be used to differentiate the two subsets ofnucleic acid molecules. The mean sequencing size of nucleic acidmolecules in one subset of nucleic acid molecules may be greater thanthe mean sequencing size of nucleic acid molecules in at least one othersubset of nucleic acid molecules. The mean sequencing size of nucleicacid molecules in one subset of nucleic acid molecules may be less thanthe mean sequencing size of nucleic acid molecules in at least one othersubset of nucleic acid molecules. The difference in mean molecularsequencing size between two or more subsets of nucleic acid moleculesmay be at least about 50; 75; 100; 125; 150; 175; 200; 225; 250; 275;300; 350; 400; 450; 500; 550; 600; 650; 700; 750; 800; 850; 900; 950;1,000; 1100; 1200; 1300; 1400; 1500; 1600; 1700; 1800; 1900; 2,000;3,000; 4,000; 5,000; 6,000; 7,000; 8,000; 9,000; 10,000; 15,000; 20,000;30,000; 40,000; 50,000; 60,000; 70,000; 80,000; 90,000; 100,000 or morebases or basepairs. In some aspects of the disclosure, the difference inmean molecular sequencing size between two or more subsets of nucleicacid molecules is at least about 200 bases or bases pairs.Alternatively, the difference in mean molecular sequencing size betweentwo or more subsets of nucleic acid molecules is at least about 300bases or bases pairs.

Two or more subsets of nucleic acid molecules may be at least partiallycomplementary. For example, a first subset of nucleic acid molecules maycomprise nucleic acid molecules comprising at least a first portion ofthe genome and a second subset of nucleic acid molecules may comprisenucleic acid molecules comprising at least a second portion of thegenome, wherein the first and second portion of the genome differ by oneor more nucleic acid molecules. Thus, the first subset and the secondsubset are at least partially complementary. The complementarity of twoor more subsets of nucleic acid molecules may be at least about 10%,15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%,85%, 90%, 95%, 97%, or more. As used herein, the term “complementarityof two or more subsets of nucleic acid molecules” generally refers togenomic content of the two or more subsets and the extent to which thetwo or more subsets encompass the content of one or more genomicregions. For example, a first subset of nucleic acid molecules comprises50% of total high GC exomes and a second subset of nucleic acidmolecules comprises 50% of the total low GC exomes, then thecomplementarity of the two subsets of nucleic acid molecules inreference to an entire exome is 50%. In another example, a first subsetof nucleic acid molecules comprises 100% of the total bead bound nucleicacid molecules and the second subset of nucleic acid molecules comprises100% of the total bead free nucleic acid molecules, the complementarityof the two subsets in reference to the total nucleic acid molecules is100%.

Subsets of nucleic acid molecules may comprise bead bound nucleic acidmolecules. Two or more subsets of nucleic acid molecules may bedifferentiated into bead bound nucleic acid molecules and bead freenucleic acid molecules. For example, a first subset of nucleic acidmolecules may comprise one or more bead bound nucleic acid molecules anda second subset of nucleic acid molecules may comprise bead free nucleicacid molecules. Bead free nucleic acid molecules may refer to nucleicacid molecules that are not bound to one or more beads. Bead freenucleic acid molecules may refer to nucleic acid molecules that havebeen eluted from one or more beads. For example, the nucleic acidmolecule from a bead bound nucleic acid molecule may be eluted toproduce a bead free nucleic acid molecule.

Subsets of nucleic acid molecules may comprise capture probe hybridizednucleic acid molecules. Two or more subsets of nucleic acid moleculesmay be differentiated into capture probe hybridized nucleic acidmolecules and capture probe free nucleic acid molecules. For example, afirst subset of nucleic acid molecules may comprise one or more captureprobe hybridized nucleic acid molecules and a second subset of nucleicacid molecules may comprise capture probe free nucleic acid molecules.Capture probe free nucleic acid molecules may refer to nucleic acidmolecules that are not hybridized to one or more capture probes. Captureprobe free nucleic acid molecules may refer to nucleic acid moleculesthat are dehybridized from one or more capture probes. For example, thecapture probe from a capture probe hybridized nucleic acid molecule maybe removed to produce a capture probe free nucleic acid molecule.

Capture probes may hybridize to one or more nucleic acid molecules in asample or in a subset of nucleic acid molecules. Capture probes mayhybridize to one or more genomic regions. Capture probes may hybridizeto one or more genomic regions within, around, near, or spanning one ormore genes, exons, introns, UTRs, or a combination thereof. Captureprobes may hybridize to one or more genomic regions spanning one or moregenes, exons, introns, UTRs, or a combination thereof. Capture probesmay hybridize to one or more known inDels. Capture probes may hybridizeto one or more known structural variants.

Subsets of nucleic acid molecules may comprise labeled nucleic acidmolecules. Two or more subsets of nucleic acid molecules may bedifferentiated into labeled nucleic acid molecules and non-labelednucleic acid molecules. For example, a first subset of nucleic acidmolecules may comprise one or more labeled nucleic acid molecules and asecond subset of nucleic acid molecules may comprise non-labeled nucleicacid molecules. Non-labeled nucleic acid molecules may refer to nucleicacid molecules that are not attached to one or more labels. Non-labelednucleic acid molecules may refer to nucleic acid molecules that havebeen detached from one or more labels. For example, the label from alabeled nucleic acid molecule may be removed to produce a non-labelednucleic acid molecule.

The methods and systems as disclosed herein may comprise, or comprisethe use of, one or more labels. The one or more labels may be attachedto one or more capture probes, nucleic acid molecules, beads, primers,or a combination thereof. Examples of labels include, but are notlimited to, detectable labels, such as radioisotopes, fluorophores,chemiluminophores, chromophore, lumiphore, enzymes, colloidal particles,and fluorescent microparticles, quantum dots, as well as antigens,antibodies, haptens, avidin/streptavidin, biotin, haptens, enzymescofactors/substrates, one or more members of a quenching system, achromogens, haptens, a magnetic particles, materials exhibitingnonlinear optics, semiconductor nanocrystals, metal nanoparticles,enzymes, aptamers, and one or more members of a binding pair.

The one or more subsets of nucleic acid molecules may be subjected toone or more assays. The one or more subsets of nucleic acid moleculesmay be subjected to one or more assays based on their biochemicalfeatures. The one or more subsets of nucleic acid molecules may besubjected to one or more assays based on their genomic region features.The one or more subsets of nucleic acid molecules may be subjected to 1,2, 3, 4, 5, 6, 7, 8, 9, 10 or more assays. The one or more subsets ofnucleic acid molecules may be subjected to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10or more assays based on their biochemical features. The one or moresubsets of nucleic acid molecules may be subjected to 1, 2, 3, 4, 5, 6,7, 8, 9, 10 or more assays based on their genomic region features. Theone or more subsets of nucleic acid molecules may be subjected to 1, 2,3, 4, 5, 6, 7, 8, 9, 10 or more identical assays. The one or moresubsets of nucleic acid molecules may be subjected to 1, 2, 3, 4, 5, 6,7, 8, 9, 10 or more identical assays based on their biochemicalfeatures. The one or more subsets of nucleic acid molecules may besubjected to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more identical assaysbased on their genomic region features. The one or more subsets ofnucleic acid molecules may be subjected to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10or more similar assays. The one or more subsets of nucleic acidmolecules may be subjected to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or moresimilar assays based on their biochemical features. The one or moresubsets of nucleic acid molecules may be subjected to 1, 2, 3, 4, 5, 6,7, 8, 9, 10 or more similar assays based on their genomic regionfeatures. The one or more subsets of nucleic acid molecules may besubjected to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more different assays. Theone or more subsets of nucleic acid molecules may be subjected to 1, 2,3, 4, 5, 6, 7, 8, 9, 10 or more different assays based on theirbiochemical features. The one or more subsets of nucleic acid moleculesmay be subjected to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more differentassays based on their genomic region features. The two or more subsetsof nucleic acid molecules may be subjected to one or more identicalprocessing steps based on their biochemical features. The two or moresubsets of nucleic acid molecules may be subjected to one or moreidentical processing steps based on their genomic region features. Thetwo or more subsets of nucleic acid molecules may be subjected to one ormore similar processing steps based on their biochemical features. Thetwo or more subsets of nucleic acid molecules may be subjected to one ormore similar processing steps based on their genomic region features.The two or more subsets of nucleic acid molecules may be subjected toone or more different processing steps based on their biochemicalfeatures. The two or more subsets of nucleic acid molecules may besubjected to one or more different processing steps based on theirgenomic region features.

The methods and systems as disclosed herein may comprise, or comprisethe use of, producing two or more subsets of nucleic acid molecules. Thetwo or more subsets of nucleic acid molecules may be separatedfluidically, separated into two or more containers, separated into twoor more locations, or a combination thereof. For example, a first subsetof nucleic acid molecules and a second subset of nucleic acid moleculesare fluidically separated. In another example, a first subset of nucleicacid molecules is in a first container and a second subset of nucleicacid molecules is in a second container. In yet another example, a firstsubset of nucleic acid molecules and a second subset of nucleic acidmolecules are assigned to two or more locations on a first container,and a third subset of nucleic acid molecules is in a second container.

Genomic Regions

The methods and systems as disclosed herein may comprise, or comprisethe use of, nucleic acid samples or subsets of nucleic acid moleculescomprising one or more genomic regions. The methods and systems asdisclosed herein may comprise, or comprise the use of, nucleic acidsamples or subsets of nucleic acid molecules comprising one or more setsof genomic regions. The one or more genomic regions may comprise one ormore genomic region features. The genomic region features may comprisean entire genome or a portion thereof. The genomic region features maycomprise an entire exome or a portion thereof. The genomic regionfeatures may comprise one or more sets of genes. The genomic regionfeatures may comprise one or more genes. The genomic region features maycomprise one or more sets of regulatory elements. The genomic regionfeatures may comprise one or more regulatory elements. The genomicregion features may comprise a set of polymorphisms. The genomic regionfeatures may comprise one or more polymorphisms. The genomic regionfeature may relate to the GC content, complexity, and/or mappability ofone or more nucleic acid molecules. The genomic region features maycomprise one or more simple tandem repeats (STRs), unstable expandingrepeats, segmental duplications, single and paired read degenerativemapping scores, GRCh37 patches, or a combination thereof. The genomicregion features may comprise one or more low mean coverage regions fromwhole genome sequencing (WGS), zero mean coverage regions from WGS,validated compressions, or a combination thereof. The genomic regionfeatures may comprise one or more alternate or non-reference sequences.The genomic region features may comprise one or more gene phasing andreassembly genes. In some aspects of the disclosure, the one or moregenomic region features are not mutually exclusive. For example, agenomic region feature comprising an entire genome or a portion thereofcan overlap with an additional genomic region feature such as an entireexome or a portion thereof, one or more genes, one or more regulatoryelements, etc. Alternatively, the one or more genomic region futures aremutually exclusive. For example, a genomic region comprising thenoncoding portion of an entire genome would not overlap with a genomicregion feature such as an exome or portion thereof or the coding portionof a gene. Alternatively, or additionally, the one or more genomicregion features are partially exclusive or partially inclusive. Forexample, a genomic region comprising an entire exome or a portionthereof can partially overlap with a genomic region comprising an exonportion of a gene. However, the genomic region comprising the entireexome or portion thereof would not overlap with the genomic regioncomprising the intron portion of the gene. Thus, a genomic regionfeature comprising a gene or portion thereof may partially excludeand/or partially include a genomic region feature comprising an entireexome or portion thereof.

The methods and systems as disclosed herein may comprise, or comprisethe use of, nucleic acid samples or subsets of nucleic acid moleculescomprising one or more genomic regions, wherein at least one of the oneor more genomic regions comprises a genomic region feature comprising anentire genome or portion thereof. The entire genome or portion thereofmay comprise one or more coding portions of the genome, one or morenoncoding portions of the genome, or a combination thereof. The codingportion of the genome may comprise one or more coding portions of a geneencoding for one or more proteins. The one or more coding portions ofthe genome may comprise an entire exome or a portion thereof.Alternatively, or additionally, the one or more coding portions of thegenome may comprise one or more exons. The one or more noncodingportions of the genome may comprise one or more noncoding molecules or aportion thereof. The noncoding molecules may comprise one or morenoncoding RNA, one or more regulatory elements, one or more introns, oneor more pseudogenes, one or more repeat sequences, one or moretransposons, one or more viral elements, one or more telomeres, aportion thereof, or a combination thereof. The noncoding RNAs may befunctional RNA molecules that are not translated into protein. Examplesof noncoding RNAs include, but are not limited to, ribosomal RNA,transfer RNA, piwi-interacting RNA, microRNA, siRNA, shRNA, snoRNA,sncRNA, and lncRNA. Pseudogenes may be related to known genes and aretypically no longer expressed. Repeat sequences may comprise one or moretandem repeats, one or more interspersed repeats, or a combinationthereof. Tandem repeats may comprise one or more satellite DNA, one ormore minisatellites, one or more microsatellites, or a combinationthereof. Interspersed repeats may comprise one or more transposons.Transposons may be mobile genetic elements. Mobile genetic elements areoften able to change their position within the genome. Transposons maybe classified as class I transposable elements (class I TEs) or class IItransposable elements (class II TEs). Class I TEs (e.g.,retrotransposons) may often copy themselves in two stages, first fromDNA to RNA by transcription, then from RNA back to DNA by reversetranscription. The DNA copy may then be inserted into the genome in anew position. Class I TEs may comprise one or more long terminal repeats(LTRs), one or more long interspersed nuclear elements (LINEs), one ormore short interspersed nuclear elements (SINEs), or a combinationthereof. Examples of LTRs include, but are not limited to, humanendogeneous retroviruses (HERVs), medium reiterated repeats 4 (MER4),and retrotransposon. Examples of LINEs include, but are not limited to,LINE1 and LINE2. SINEs may comprise one or more Alu sequences, one ormore mammalian-wide interspersed repeat (MIR), or a combination thereof.Class II TEs (e.g., DNA transposons) often do not involve an RNAintermediate. The DNA transposon is often cut from one site and insertedinto another site in the genome. Alternatively, the DNA transposon isreplicated and inserted into the genome in a new position. Examples ofDNA transposons include, but are not limited to, MER1, MER2, andmariners. Viral elements may comprise one or more endogenous retrovirussequences. Telomeres are often regions of repetitive DNA at the end of achromosome.

The methods and systems as disclosed herein may comprise, or comprisethe use of, nucleic acid samples or subsets of nucleic acid moleculescomprising one or more genomic regions, wherein at least one of the oneor more genomic regions comprises a genomic region feature comprising anentire exome or portion thereof. The exome is often the part of thegenome formed by exons. The exome may be formed by untranslated regions(UTRs), splice sites and/or intronic regions. The entire exome orportion thereof may comprise one or more exons of a protein coding gene.The entire exome or portion thereof may comprise one or moreuntranslated regions (UTRs), splice sites, and introns.

The methods and systems as disclosed herein may comprise, or comprisethe use of, nucleic acid samples or subsets of nucleic acid moleculescomprising one or more genomic regions, wherein at least one of the oneor more genomic regions comprises a genomic region feature comprising agene or portion thereof. Typically, a gene comprises stretches ofnucleic acids that code for a polypeptide or a functional RNA. A genemay comprise one or more exons, one or more introns, one or moreuntranslated regions (UTRs), or a combination thereof. Exons are oftencoding sections of a gene, transcribed into a precursor mRNA sequence,and within the final mature RNA product of the gene. Introns are oftennoncoding sections of a gene, transcribed into a precursor mRNAsequence, and removed by RNA splicing. UTRs may refer to sections oneach side of a coding sequence on a strand of mRNA. A UTR located on the5′ side of a coding sequence may be called the 5′ UTR (or leadersequence). A UTR located on the 3′ side of a coding sequence may becalled the 3′ UTR (or trailer sequence). The UTR may contain one or moreelements for controlling gene expression. Elements, such as regulatoryelements, may be located in the 5′ UTR. Regulatory sequences, such as apolyadenylation signal, binding sites for proteins, and binding sitesfor miRNAs, may be located in the 3′ UTR. Binding sites for proteinslocated in the 3′ UTR may include, but are not limited to,selenocysteine insertion sequence (SECIS) elements and AU-rich elements(AREs). SECIS elements may direct a ribosome to translate the codon UGAas selenocysteine rather than as a stop codon. AREs are often stretchesconsisting primarily of adenine and uracil nucleotides, which may affectthe stability of a mRNA.

The methods and systems as disclosed herein may comprise, or comprisethe use of, nucleic acid samples or subsets of nucleic acid moleculescomprising one or more genomic regions, wherein at least one of the oneor more genomic regions comprises a genomic region feature comprising aset of genes. The sets of genes may include, but are not limited to,Mendel DB Genes, Human Gene Mutation Database (HGMD) Genes, Cancer GeneCensus Genes, Online Mendelian Inheritance in Man (OMIM) MendelianGenes, HGMD Mendelian Genes, and human leukocyte antigen (HLA) Genes.The set of genes may have one or more known Mendelian traits, one ormore known disease traits, one or more known drug traits, one or moreknown biomedically interpretable variants, or a combination thereof. AMendelian trait may be controlled by a single locus and may show aMendelian inheritance pattern. A set of genes with known Mendeliantraits may comprise one or more genes encoding Mendelian traitsincluding, but are not limited to, ability to taste phenylthiocarbamide(dominant), ability to smell (bitter almond-like) hydrogen cyanide(recessive), albinism (recessive), brachydactyly (shortness of fingersand toes), and wet (dominant) or dry (recessive) earwax. A disease traitcause or increase risk of disease may be inherited in a Mendelian orcomplex pattern. A set of genes with known disease traits may compriseone or more genes encoding disease traits including, but are not limitedto, Cystic Fibrosis, Hemophilia, and Lynch Syndrome. A drug trait mayalter metabolism, optimal dose, adverse reactions and side effects ofone or more drugs or family of drugs. A set of genes with known drugtraits may comprise one or more genes encoding drug traits including,but are not limited to, CYP2D6, UGT1A1 and ADRB1. A biomedicallyinterpretable variant may be a polymorphism in a gene that is associatedwith a disease or indication. A set of genes with known biomedicallyinterpretable variants may comprise one or more genes encodingbiomedically interpretable variants including, but are not limited to,cystic fibrosis (CF) mutations, muscular dystrophy mutations, p53mutations, Rb mutations, cell cycle regulators, receptors, and kinases.Alternatively, or additionally, a set of genes with known biomedicallyinterpretable variants may comprise one or more genes associated withHuntington's disease, cancer, cystic fibrosis, muscular dystrophy (e.g.,Duchenne muscular dystrophy).

The methods and systems as disclosed herein may comprise, or comprisethe use of, nucleic acid samples or subsets of nucleic acid moleculescomprising one or more genomic regions, wherein at least one of the oneor more genomic regions comprises a genomic region feature comprising aregulatory element or a portion thereof. Regulatory elements may becis-regulatory elements or trans-regulatory elements. Cis-regulatoryelements may be sequences that control transcription of a nearby gene.Cis-regulatory elements may be located in the 5′ or 3′ untranslatedregions (UTRs) or within introns. Trans-regulatory elements may controltranscription of a distant gene. Regulatory elements may comprise one ormore promoters, one or more enhancers, or a combination thereof.Promoters may facilitate transcription of a particular gene and may befound upstream of a coding region. Enhancers may exert distant effectson the transcription level of a gene.

The methods and systems as disclosed herein may comprise, or comprisethe use of, nucleic acid samples or subsets of nucleic acid moleculescomprising one or more genomic regions, wherein at least one of the oneor more genomic regions comprises a genomic region feature comprising apolymorphism or a portion thereof. Generally, a polymorphism refers to amutation in a genotype. A polymorphism may comprise one or more basechanges, an insertion, a repeat, or a deletion of one or more bases.Copy number variants (CNVs), transversions, and other rearrangements arealso forms of genetic variation. Polymorphic markers include restrictionfragment length polymorphisms, variable number of tandem repeats(VNTR's), hypervariable regions, mini satellites, dinucleotide repeats,trinucleotide repeats, tetranucleotide repeats, simple sequence repeats,and insertion elements such as Alu. The allelic form occurring mostfrequently in a selected population is sometimes referred to as thewildtype form. Diploid organisms may be homozygous or heterozygous forallelic forms. A diallelic polymorphism has two forms. A triallelicpolymorphism has three forms. Single nucleotide polymorphisms (SNPs) area form of polymorphisms. In some aspects of the disclosure, one or morepolymorphisms comprise one or more single nucleotide variations, inDels,small insertions, small deletions, structural variant junctions,variable length tandem repeats, flanking sequences, or a combinationthereof. The one or more polymorphisms may be located within a codingand/or non-coding region. The one or more polymorphisms may be locatedwithin, around, or near a gene, exon, intron, splice site, untranslatedregion, or a combination thereof. The one or more polymorphisms may spanat least a portion of a gene, exon, intron, untranslated region.

The methods and systems as disclosed herein may comprise, or comprisethe use of, nucleic acid samples or subsets of nucleic acid moleculescomprising one or more genomic regions, wherein at least one of the oneor more genomic regions comprises a genomic region feature comprisingone or more simple tandem repeats (STRs), unstable expanding repeats,segmental duplications, single and paired read degenerative mappingscores, GRCh37 patches, or a combination thereof. The one or more STRsmay comprise one or more homopolymers, one or more dinucleotide repeats,one or more trinucleotide repeats, or a combination thereof. The one ormore homopolymers may be about 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,18, 19, 20 or more bases or basepairs. The dinucleotide repeats and/ortrinucleotide repeats may be about 15, 16, 17, 18, 19, 20, 21, 22, 23,24, 25, 26, 27, 28, 29, 30, 35, 40, 45, 50 or more bases or basepairs.The single and paired read degenerative mapping scores may be based onor derived from alignability of 100 mers by GEM from ENCODE/CRG (Guigo),alignability of 75 mers by GEM from ENCODE/CRG (Guigo), 100 base pairbox car average for signal mappability, max of locus and possible pairsfor paired read score, or a combination thereof. The genomic regionfeatures may comprise one or more low mean coverage regions from wholegenome sequencing (WGS), zero mean coverage regions from WGS, validatedcompressions, or a combination thereof. The low mean coverage regionsfrom WGS may comprise regions generated from Illumina v3 chemistry,regions below the first percentile of Poission distribution based onmean coverage, or a combination thereof. The Zero mean coverage regionsfrom WGS may comprise regions generated from Illumina v3 chemistry. Thevalidated compressions may comprise regions of high mapped depth,regions with two or more observed haplotypes, regions expected to bemissing repeats in a reference, or a combination thereof. The genomicregion features may comprise one or more alternate or non-referencesequences. The one or more alternate or non-reference sequences maycomprise known structural variant junctions, known insertions, knowndeletions, alternate haplotypes, or a combination thereof. The genomicregion features may comprise one or more gene phasing and reassemblygenes. Examples of phasing and reassembly genes include, but are notlimited to, one or more major histocompatibility complexes, bloodtyping, and amylase gene family. The one or more majorhistocompatibility complexes may comprise one or more HLA Class I, HLAClass II, or a combination thereof. The one or more HLA class I maycomprise HLA-A, HLA-B, HLA-C, or a combination thereof. The one or moreHLA class II may comprise HLA-DP, HLA-DM, HLA-DOA, HLA-DOB, HLA-DQ,HLA-DR, or a combination thereof. The blood typing genes may compriseABO, RHD, RHCE, or a combination thereof.

The methods and systems as disclosed herein may comprise, or comprisethe use of, nucleic acid samples or subsets of nucleic acid moleculescomprising one or more genomic regions, wherein at least one of the oneor more genomic regions comprises a genomic region feature related tothe GC content of one or more nucleic acid molecules. The GC content mayrefer to the GC content of a nucleic acid molecule. Alternatively, theGC content may refer to the GC content of one or more nucleic acidmolecules and may be referred to as the mean GC content. As used herein,the terms “GC content” and “mean GC content” may be usedinterchangeably. The GC content of a genomic region may be a high GCcontent. Typically, a high GC content refers to a GC content of greaterthan or equal to about 65%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, or more.In some aspects of the disclosure, a high GC content may refer to a GCcontent of greater than or equal to about 70%. The GC content of agenomic region may be a low GC content. Typically, a low GC contentrefers to a GC content of less than or equal to about 65%, 60%, 55%,50%, 45%, 40%, 35%, 30%, 25%, 20%, 15%, 10%, 5%, 2%, or less.

The difference in GC content may be used to differentiate two or moregenomic regions or two or more subsets of nucleic acid molecules. Thedifference in GC content may refer to the difference in GC content ofone nucleic acid molecule and another nucleic acid molecule.Alternatively, the difference in GC content may refer to the differencein mean GC content of two or more nucleic acid molecules in a genomicregion from the mean GC content of two or more nucleic acid molecules inanother genomic region. In some aspects of the disclosure, thedifference in GC content refers to the difference in mean GC content oftwo or more nucleic acid molecules in a subset of nucleic acid moleculesfrom the mean GC content of two or more nucleic acid molecules inanother subset of nucleic acid molecules. The difference in GC contentmay be about 5%, 6%, 7%, 8%, 9%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, ormore. In some aspects of the disclosure, the difference in GC content isat least about 5%. The difference in GC content may be at least about10%.

The methods and systems as disclosed herein may comprise, or comprisethe use of, nucleic acid samples or subsets of nucleic acid moleculescomprising one or more genomic regions, wherein at least one of the oneor more genomic regions comprises a genomic region feature related tothe complexity of one or more nucleic acid molecules. The complexity ofa nucleic acid molecule may refer to the randomness of a nucleotidesequence. Low complexity may refer to patterns, repeats and/or depletionof one or more species of nucleotide in the sequence.

The methods and systems as disclosed herein may comprise, or comprisethe use of, nucleic acid samples or subsets of nucleic acid moleculescomprising one or more genomic regions, wherein at least one of the oneor more genomic regions comprises a genomic region feature related tothe mappability of one or more nucleic acid molecules. The mappabilityof a nucleic acid molecule may refer to uniqueness of its alignment to areference sequence. A nucleic acid molecule with low mappability mayhave poor alignment to a reference sequence.

The methods and systems as disclosed herein may comprise, or comprisethe use of, nucleic acid samples or subsets of nucleic acid moleculescomprising one or more genomic regions comprising one or more genomicregion features. In some aspects of the disclosure, a single genomicregion comprises 1 or more, 2 or more, 3 or more, 4 or more, 5 or more,6 or more, 7 or more, 8 or more, 9 or more, 10 or more, 11 or more, 12or more, 13 or more, 14 or more, or 15 or more genomic region features.The two or more genomic regions may comprise 1 or more, 2 or more, 3 ormore, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more,10 or more, 11 or more, 12 or more, 13 or more, 14 or more, 15 or more,20 or more, 25 or more, 30 or more, 35 or more, 40 or more, 50 or more,60 or more, 70 or more, 80 or more, 90 or more, or 100 or more genomicregion features. In some aspects of the disclosure, two or more genomicregions comprise 1 or more, 2 or more, 3 or more, 4 or more, 5 or more,6 or more, 7 or more, 8 or more, 9 or more, 10 or more, 11 or more, 12or more, 13 or more, 14 or more, or 15 or more genomic region features.The one or more genomic regions may comprise 1 or more, 2 or more, 3 ormore, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more,10 or more, 11 or more, 12 or more, 13 or more, 14 or more, 15 or more,20 or more, 25 or more, 30 or more, 35 or more, 40 or more, 50 or more,60 or more, 70 or more, 80 or more, 90 or more, or 100 or more identicalor similar genomic region features. Alternatively, or additionally, twoor more genomic regions comprise 1 or more, 2 or more, 3 or more, 4 ormore, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more,11 or more, 12 or more, 13 or more, 14 or more, or 15 or more genomicregion features. The one or more genomic regions may comprise 1 or more,2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 ormore, 9 or more, 10 or more, 11 or more, 12 or more, 13 or more, 14 ormore, 15 or more, 20 or more, 25 or more, 30 or more, 35 or more, 40 ormore, 50 or more, 60 or more, 70 or more, 80 or more, 90 or more, or 100or more different genomic region features.

The methods and systems as disclosed herein may comprise, or comprisethe use of, nucleic acid samples or subsets of nucleic acid moleculescomprising two or more genomic regions, wherein the two or more genomicregions are differentiateable by one or more genomic region features.The methods and systems as disclosed herein may comprise, or comprisethe use of, nucleic acid samples or subsets of nucleic acid moleculescomprising two or more subsets of nucleic acid molecules, wherein thetwo or more subsets of nucleic acid molecules are differentiateable byone or more genomic region features. The two or more genomic regionsand/or the two or more subsets of nucleic acid molecules may bedifferentiateable by 1 or more, 2 or more, 3 or more, 4 or more, 5 ormore, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more, 11 ormore, 12 or more, 13 or more, 14 or more, or 15 or more genomic regionfeatures. The one or more genomic regions may comprise 1 or more, 2 ormore, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more,9 or more, 10 or more, 11 or more, 12 or more, 13 or more, 14 or more,15 or more, 20 or more, 25 or more, or 30 or more genomic regionfeatures.

The methods and systems as disclosed herein may comprise, or comprisethe use of, nucleic acid samples or subsets of nucleic acid moleculescomprising one or more sets of genomic regions. For example, The methodsand systems as disclosed herein may, or comprise the use of, comprisenucleic acid samples or subsets of nucleic acid molecules comprising, 1or more, 2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 ormore, 8 or more, 9 or more, 10 or more, 11 or more, 12 or more, 13 ormore, 14 or more, 15 or more, 20 or more, 25 or more, 30 or more, 35 ormore, 40 or more, 50 or more, 60 or more, 70 or more, 80 or more, 90 ormore, or 100 or more sets of genomic regions. The one or more sets ofgenomic regions may comprise 1 or more, 2 or more, 3 or more, 4 or more,5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more, 11 ormore, 12 or more, 13 or more, 14 or more, 15 or more, 20 or more, 25 ormore, 30 or more, 35 or more, 40 or more, 50 or more, 60 or more, 70 ormore, 80 or more, 90 or more, or 100 or more different genomic regions.The one or more sets of genomic regions may comprise 1 or more, 2 ormore, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more, 8 or more,9 or more, 10 or more, 11 or more, 12 or more, 13 or more, 14 or more,15 or more, 20 or more, 25 or more, 30 or more, 35 or more, 40 or more,50 or more, 60 or more, 70 or more, 80 or more, 90 or more, or 100 ormore identical or similar genomic regions. The one or more sets ofgenomic regions may comprise a combination of one or more differentgenomic regions and one or more identical or similar genomic regions.

Capture Probes

The methods and systems disclosed herein may comprise, or comprise theuse of, one or more capture probes, a plurality of capture probes, orone or more capture probe sets. Typically, the capture probe comprises anucleic acid binding site. The capture probe may further comprise one ormore linkers. The capture probes may further comprise one or morelabels. The one or more linkers may attach the one or more labels to thenucleic acid binding site.

The methods and systems disclosed herein may comprise, or comprise theuse of, 1 or more, 2 or more, 3 or more, 4 or more, 5 or more, 6 ormore, 7 or more, 8 or more, 9 or more, 10 or more, 20 or more, 30 ormore, 40 or more, 50 or more, 60 or more, 70 or more, 80 or more, 90 ormore, 100 or more, 125 or more, 150 or more, 175 or more, 200 or more,250 or more, 300 or more, 350 or more, 400 or more, 500 or more, 600 ormore, 700 or more, 800 or more, 900 or more, or 1000 or more one or morecapture probes or capture probe sets. The one or more capture probes orcapture probe sets may be different, similar, identical, or acombination thereof.

The one or more capture probe may comprise a nucleic acid binding sitethat hybridizes to at least a portion of the one or more nucleic acidmolecules or variant or derivative thereof in the sample or subset ofnucleic acid molecules. The capture probes may comprise a nucleic acidbinding site that hybridizes to one or more genomic regions. The captureprobes may hybridize to different, similar, and/or identical genomicregions. The one or more capture probes may be at least about 50%, 55%,60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, 99% or more complementaryto the one or more nucleic acid molecules or variants or derivativesthereof.

The capture probes may comprise one or more nucleotides. The captureprobes may comprise 1 or more, 2 or more, 3 or more, 4 or more, 5 ormore, 6 or more, 7 or more, 8 or more, 9 or more, 10 or more, 20 ormore, 30 or more, 40 or more, 50 or more, 60 or more, 70 or more, 80 ormore, 90 or more, 100 or more, 125 or more, 150 or more, 175 or more,200 or more, 250 or more, 300 or more, 350 or more, 400 or more, 500 ormore, 600 or more, 700 or more, 800 or more, 900 or more, or 1000 ormore nucleotides. The capture probes may comprise about 100 nucleotides.The capture probes may comprise between about 10 to about 500nucleotides, between about 20 to about 450 nucleotides, between about 30to about 400 nucleotides, between about 40 to about 350 nucleotides,between about 50 to about 300 nucleotides, between about 60 to about 250nucleotides, between about 70 to about 200 nucleotides, or between about80 to about 150 nucleotides. In some aspects of the disclosure, thecapture probes comprise between about 80 nucleotides to about 100nucleotides.

The plurality of capture probes or the capture probe sets may comprisetwo or more capture probes with identical, similar, and/or differentnucleic acid binding site sequences, linkers, and/or labels. Forexample, two or more capture probes comprise identical nucleic acidbinding sites. In another example, two or more capture probes comprisesimilar nucleic acid binding sites. In yet another example, two or morecapture probes comprise different nucleic acid binding sites. The two ormore capture probes may further comprise one or more linkers. The two ormore capture probes may further comprise different linkers. The two ormore capture probes may further comprise similar linkers. The two ormore capture probes may further comprise identical linkers. The two ormore capture probes may further comprise one or more labels. The two ormore capture probes may further comprise different labels. The two ormore capture probes may further comprise similar labels. The two or morecapture probes may further comprise identical labels.

Diseases or Conditions

The methods and systems as disclosed herein may comprise, or comprisethe use of, predicting, diagnosing, and/or prognosing a status oroutcome of a disease or condition in a subject based on one or morebiomedical outputs. Predicting, diagnosing, and/or prognosing a statusor outcome of a disease in a subject may comprise diagnosing a diseaseor condition, identifying a disease or condition, determining the stageof a disease or condition, assessing the risk of a disease or condition,assessing the risk of disease recurrence, assessing reproductive risk,assessing genetic risk to a fetus, assessing the efficacy of a drug,assessing risk of an adverse drug reaction, predicting optimal drugdosage, predicting drug resistance, or a combination thereof.

The samples disclosed herein may be from a subject suffering from acancer. The sample may comprise malignant tissue, benign tissue, or amixture thereof. The cancer may be a recurrent and/or refractory cancer.Examples of cancers include, but are not limited to, sarcomas,carcinomas, lymphomas or leukemias.

Sarcomas are cancers of the bone, cartilage, fat, muscle, blood vessels,or other connective or supportive tissue. Sarcomas include, but are notlimited to, bone cancer, fibrosarcoma, chondrosarcoma, Ewing's sarcoma,malignant hemangioendothelioma, malignant schwannoma, bilateralvestibular schwannoma, osteosarcoma, soft tissue sarcomas (e.g.,alveolar soft part sarcoma, angiosarcoma, cystosarcoma phylloides,dermatofibrosarcoma, desmoid tumor, epithelioid sarcoma, extraskeletalosteosarcoma, fibrosarcoma, hemangiopericytoma, hemangiosarcoma,Kaposi's sarcoma, leiomyosarcoma, liposarcoma, lymphangiosarcoma,lymphosarcoma, malignant fibrous histiocytoma, neurofibrosarcoma,rhabdomyosarcoma, and synovial sarcoma).

Carcinomas are cancers that begin in the epithelial cells, which arecells that cover the surface of the body, produce hormones, and make upglands. By way of non-limiting example, carcinomas include breastcancer, pancreatic cancer, lung cancer, colon cancer, colorectal cancer,rectal cancer, kidney cancer, bladder cancer, stomach cancer, prostatecancer, liver cancer, ovarian cancer, brain cancer, vaginal cancer,vulvar cancer, uterine cancer, oral cancer, penile cancer, testicularcancer, esophageal cancer, skin cancer, cancer of the fallopian tubes,head and neck cancer, gastrointestinal stromal cancer, adenocarcinoma,cutaneous or intraocular melanoma, cancer of the anal region, cancer ofthe small intestine, cancer of the endocrine system, cancer of thethyroid gland, cancer of the parathyroid gland, cancer of the adrenalgland, cancer of the urethra, cancer of the renal pelvis, cancer of theureter, cancer of the endometrium, cancer of the cervix, cancer of thepituitary gland, neoplasms of the central nervous system (CNS), primaryCNS lymphoma, brain stem glioma, and spinal axis tumors. The cancer maybe a skin cancer, such as a basal cell carcinoma, squamous, melanoma,nonmelanoma, or actinic (solar) keratosis.

The cancer may be a lung cancer. Lung cancer can start in the airwaysthat branch off the trachea to supply the lungs (bronchi) or the smallair sacs of the lung (the alveoli). Lung cancers include non-small celllung carcinoma (NSCLC), small cell lung carcinoma, and mesotheliomia.Examples of NSCLC include squamous cell carcinoma, adenocarcinoma, andlarge cell carcinoma. The mesothelioma may be a cancerous tumor of thelining of the lung and chest cavitity (pleura) or lining of the abdomen(peritoneum). The mesothelioma may be due to asbestos exposure. Thecancer may be a brain cancer, such as a glioblastoma.

Alternatively, the cancer may be a central nervous system (CNS) tumor.CNS tumors may be classified as gliomas or nongliomas. The glioma may bemalignant glioma, high grade glioma, diffuse intrinsic pontine glioma.Examples of gliomas include astrocytomas, oligodendrogliomas (ormixtures of oligodendroglioma and astocytoma elements), and ependymomas.Astrocytomas include, but are not limited to, low-grade astrocytomas,anaplastic astrocytomas, glioblastoma multiforme, pilocytic astrocytoma,pleomorphic xanthoastrocytoma, and subependymal giant cell astrocytoma.Oligodendrogliomas include low-grade oligodendrogliomas (oroligoastrocytomas) and anaplastic oligodendriogliomas. Nongliomasinclude meningiomas, pituitary adenomas, primary CNS lymphomas, andmedulloblastomas. The cancer may be a meningioma.

The leukemia may be an acute lymphocytic leukemia, acute myelocyticleukemia, chronic lymphocytic leukemia, or chronic myelocytic leukemia.Additional types of leukemias include hairy cell leukemia, chronicmyelomonocytic leukemia, and juvenile myelomonocytic leukemia.

Lymphomas are cancers of the lymphocytes and may develop from either Bor T lymphocytes. The two major types of lymphoma are Hodgkin'slymphoma, previously known as Hodgkin's disease, and non-Hodgkin'slymphoma. Hodgkin's lymphoma is marked by the presence of theReed-Sternberg cell. Non-Hodgkin's lymphomas are all lymphomas which arenot Hodgkin's lymphoma. Non-Hodgkin lymphomas may be indolent lymphomasand aggressive lymphomas. Non-Hodgkin's lymphomas include, but are notlimited to, diffuse large B cell lymphoma, follicular lymphoma,mucosa-associated lymphatic tissue lymphoma (MALT), small celllymphocytic lymphoma, mantle cell lymphoma, Burkitt's lymphoma,mediastinal large B cell lymphoma, Waldenström macroglobulinemia, nodalmarginal zone B cell lymphoma (NMZL), splenic marginal zone lymphoma(SMZL), extranodal marginal zone B cell lymphoma, intravascular large Bcell lymphoma, primary effusion lymphoma, and lymphomatoidgranulomatosis.

Additional diseases and/or conditions include, but are not limited to,atherosclerosis, inflammatory diseases, autoimmune diseases, rheumaticheart disease. Examples of inflammatory diseases include, but are notlimited to, acne vulgaris, Alzheimer's, ankylosing spondylitis,arthritis (osteoarthritis, rheumatoid arthritis (RA), psoriaticarthritis), asthma, atherosclerosis, celiac disease, chronicprostatitis, Crohn's disease, colitis, dermatitis, diverticulitis,fibromyalgia, glomerulonephritis, hepatitis, irritable bowel syndrome(IBS), systemic lupus erythematous (SLE), nephritis, Parkinson'sdisease, pelvic inflammatory disease, sarcoidosis, ulcerative colitis,and vasculitis.

Examples of autoimmune diseases include, but are not limited to, acutedisseminated encephalomyelitis (ADEM), Addison's disease,agammaglobulinemia, alopecia areata, amyotrophic Lateral Sclerosis,ankylosing spondylitis, antiphospholipid syndrome, antisynthetasesyndrome, atopic allergy, atopic dermatitis, autoimmune aplastic anemia,autoimmune cardiomyopathy, autoimmune enteropathy, autoimmune hemolyticanemia, autoimmune hepatitis, autoimmune inner ear disease, autoimmunelymphoproliferative syndrome, autoimmune peripheral neuropathy,autoimmune pancreatitis, autoimmune polyendocrine syndrome, autoimmuneprogesterone dermatitis, autoimmune thrombocytopenic purpura, autoimmuneurticaria, autoimmune uveitis, Balo disease/Balo concentric sclerosis,Behcet's disease, Berger's disease, Bickerstaff s encephalitis, Blausyndrome, bullous pemphigoid, Castleman's disease, celiac disease,Chagas disease, chronic inflammatory demyelinating polyneuropathy,chronic recurrent multifocal osteomyelitis, chronic obstructivepulmonary disease, Churg-Strauss syndrome, cicatricial pemphigoid, Cogansyndrome, cold agglutinin disease, complement component 2 deficiency,contact dermatitis, cranial arteritis, CREST syndrome, Crohn's disease,Cushing's syndrome, cutaneous leukocytoclastic angiitis, Dego'sdiseasevDercum's disease, dermatitis herpetiformis, dermatomyositis,diabetes mellitus type 1, diffuse cutaneous systemic sclerosis,Dressler's syndrome, drug-induced lupus, discoid lupus erythematosus,eczema, endometriosis, enthesitis-related arthritis, eosinophilicfasciitis, eosinophilic gastroenteritisvepidermolysis bullosa acquisita,erythema nodosum, erythroblastosis fetalis, essential mixedcryoglobulinemia, Evan's syndrome, fibrodysplasia ossificansprogressiva, fibrosing alveolitis (or idiopathic pulmonary fibrosis),gastritis, gastrointestinal pemphigoid, giant cell arteritis,glomerulonephritis, Goodpasture's syndrome, Graves' disease,Guillain-Barré syndrome (GBS), Hashimoto's encephalopathy, Hashimoto'sthyroiditisvHenoch-Schonlein purpuravherpes gestationis aka gestationalpemphigoid, hidradenitis suppurativa, Hughes-Stovin syndrome,hypogammaglobulinemia, idiopathic inflammatory demyelinating diseases,idiopathic pulmonary fibrosis, IgA nephropathy, inclusion body myositis,chronic inflammatory demyelinating polyneuropathyvinterstitial cystitis,juvenile idiopathic arthritis aka juvenile rheumatoid arthritis,Kawasaki's disease, Lambert-Eaton myasthenic syndrome, leukocytoclasticvasculitis, Lichen planus, Lichen sclerosus, linear IgA disease (LAD),Lou Gehrig's disease (Also Amyotrophic lateral sclerosis), lupoidhepatitis aka autoimmune hepatitis, lupus erythematosus, Majeedsyndrome, Ménière's disease, microscopic polyangiitis, mixed connectivetissue disease, morphea, Mucha-Habermann disease, multiple sclerosis,myasthenia gravis, myositis, neuromyelitis optica (also Devic'sdisease), neuromyotonia, occular cicatricial pemphigoid, opsoclonusmyoclonus syndrome, Ord's thyroiditis, palindromic rheumatism, PANDAS(pediatric autoimmune neuropsychiatric disorders associated withstreptococcus), paraneoplastic cerebellar degeneration, paroxysmalnocturnal hemoglobinuria (PNH), Parry Romberg syndrome, Parsonage-Turnersyndrome, Pars planitis, pemphigus vulgaris, pernicious anaemia,perivenous encephalomyelitis, POEMS syndrome, polyarteritis nodosa,polymyalgia rheumatica, polymyositis, primary biliary cirrhosis, primarysclerosing cholangitis, progressive inflammatory neuropathy, psoriasis,psoriatic arthritis, pyoderma gangrenosum, pure red cell aplasia,Rasmussen's encephalitis, Raynaud phenomenon, relapsing polychondritis,Reiter's syndrome, restless leg syndrome, retroperitoneal fibrosis,rheumatoid arthritis, rheumatic fever, sarcoidosis, Schmidt syndromeanother form of APS, Schnitzler syndrome, scleritis, scleroderma, serumsickness, Sjögren's syndrome, spondyloarthropathy, Stiff personsyndrome, subacute bacterial endocarditis (SBE), Susac's syndrome,Sweet's syndrome, sympathetic ophthalmia, Takayasu's arteritis, temporalarteritis (also known as “giant cell arteritis”), thrombocytopenia,Tolosa-Hunt syndrome, transverse myelitis, ulcerative colitis,undifferentiated connective tissue disease different from mixedconnective tissue disease, undifferentiated spondyloarthropathy,urticarial vasculitis, vasculitis, vitiligo, and Wegener'sgranulomatosis.

The methods and systems as provided herein may also be useful fordetecting, monitoring, diagnosing and/or predicting a subject's responseto an implanted device. Exemplary medical devices include but are notlimited to stents, replacement heart valves, implanted cerebellastimulators, hip replacement joints, breast implants, and knee implants.

The methods and systems as disclosed herein may be used for monitoringthe health of a fetus using whole or partial genome analysis of nucleicacids derived from a fetus, as compared to the maternal genome. Forexample, nucleic acids can be useful in pregnant subjects for fetaldiagnostics, with fetal nucleic acids serving as a marker for gender,rhesus D status, fetal aneuploidy, and sex-linked disorders. The methodsand systems as disclosed herein may identify fetal mutations or geneticabnormalities. The methods and systems as disclosed herein can enabledetection of extra or missing chromosomes, particularly those typicallyassociated with birth defects or miscarriage. The methods and systems asdisclosed herein may comprise, or comprise the use of, the diagnosis,prediction or monitoring of autosomal trisomies (e.g., Trisomy 13, 15,16, 18, 21, or 22) and may be based on the detection of foreignmolecules. The trisomy may be associated with an increased chance ofmiscarriage (e.g., Trisomy 15, 16, or 22). Alternatively, the trisomythat is detected is a liveborn trisomy that may indicate that an infantmay be born with birth defects (e.g., Trisomy 13 (Patau Syndrome),Trisomy 18 (Edwards Syndrome), and Trisomy 21 (Down Syndrome)). Theabnormality may also be of a sex chromosome (e.g., XXY (Klinefelter'sSyndrome), XYY (Jacobs Syndrome), or XXX (Trisomy X). The methodsdisclosed herein may comprise one or more genomic regions on thefollowing chromosomes: 13, 18, 21, X, or Y. For example, the foreignmolecule may be on chromosome 21 and/or on chromosome 18, and/or onchromosome 13. The one or more genomic regions may comprise multiplesites on multiple chromosomes.

Further fetal conditions that can be determined based on the methods andsystems herein include monosomy of one or more chromosomes (X chromosomemonosomy, also known as Turner's syndrome), trisomy of one or morechromosomes (13, 18, 21, and X), tetrasomy and pentasomy of one or morechromosomes (which in humans is most commonly observed in the sexchromosomes, e.g., XXXX, XXYY, XXXY, XYYY, XXXXX, XXXXY, XXXYY, XYYYYand XXYYY), monoploidy, triploidy (three of every chromosome, e.g., 69chromosomes in humans), tetraploidy (four of every chromosome, e.g., 92chromosomes in humans), pentaploidy and multiploidy.

The methods and systems as disclosed may comprise detecting, monitoring,quantitating, or evaluating one or more pathogen-derived nucleic acidmolecules or one or more diseases or conditions caused by one or morepathogens. Exemplary pathogens include, but are not limited to,Bordetella, Borrelia, Brucella, Campylobacter, Chlamydia, Chlamydophila,Clostridium, Corynebacterium, Enterococcus, Escherichia, Francisella,Haemophilus, Helicobacter, Legionella, Leptospira, Listeria,Mycobacterium, Mycoplasma, Neisseria, Pseudomonas, Rickettsia,Salmonella, Shigella, Staphylococcus, Streptococcus, Treponema, Vibrio,or Yersinia. Additional pathogens include, but are not limited to,Mycobacterium tuberculosis, Streptococcus, Pseudomonas, Shigella,Campylobacter, and Salmonella.

The disease or conditions caused by one or more pathogens may comprisetuberculosis, pneumonia, foodborne illnesses, tetanus, typhoid fever,diphtheria, syphilis, leprosy, bacterial vaginosis, bacterialmeningitis, bacterial pneumonia, a urinary tract infection, bacterialgastroenteritis, and bacterial skin infection. Examples of bacterialskin infections include, but are not limited to, impetigo which may becaused by Staphylococcus aureus or Streptococcus pyogenes; erysipelaswhich may be caused by a streptococcus bacterial infection of the deepepidermis with lymphatic spread; and cellulitis which may be caused bynormal skin flora or by exogenous bacteria.

The pathogen may be a fungus, such as, Candida, Aspergillus,Cryptococcus, Histoplasma, Pneumocystis, and Stachybotrys. Examples ofdiseases or conditions caused by a fungus include, but are not limitedto, jock itch, yeast infection, ringworm, and athlete's foot.

The pathogen may be a virus. Examples of viruses include, but are notlimited to, adenovirus, coxsackievirus, Epstein-Barr virus, Hepatitisvirus (e.g., Hepatitis A, B, and C), herpes simplex virus (type 1 and2), cytomegalovirus, herpes virus, HIV, influenza virus, measles virus,mumps virus, papillomavirus, parainfluenza virus, poliovirus,respiratory syncytial virus, rubella virus, and varicella-zoster virus.Examples of diseases or conditions caused by viruses include, but arenot limited to, cold, flu, hepatitis, AIDS, chicken pox, rubella, mumps,measles, warts, and poliomyelitis.

The pathogen may be a protozoan, such as Acanthamoeba (e.g., A.astronyxis, A. castellanii, A. culbertsoni, A. hatchetti, A. polyphaga,A. rhysodes, A. healyi, A. divionensis), Brachiola (e.g., B connori, B.vesicularum), Cryptosporidium (e.g., C. parvum), Cyclospora (e.g., C.cayetanensis), Encephalitozoon (e.g., E. cuniculi, E. hellem, E.intestinalis), Entamoeba (e.g., E. histolytica), Enterocytozoon (e.g.,E. bieneusi), Giardia (e.g., G. lamblia), Isospora (e.g, I. belli),Microsporidium (e.g., M africanum, M ceylonensis), Naegleria (e.g., N.fowleri), Nosema (e.g., N. algerae, N. ocularum), Pleistophora,Trachipleistophora (e.g., T anthropophthera, T hominis), and Vittaforma(e.g., V. corneae).

Therapeutics

The methods and systems as disclosed herein may comprise, or comprisethe use of, treating and/or preventing a disease or condition in asubject based on one or more biomedical outputs. The one or morebiomedical outputs may recommend one or more therapies. The one or morebiomedical outputs may suggest, select, designate, recommend orotherwise determine a course of treatment and/or prevention of a diseaseor condition. The one or more biomedical outputs may recommend modifyingor continuing one or more therapies. Modifying one or more therapies maycomprise administering, initiating, reducing, increasing, and/orterminating one or more therapies. The one or more therapies comprise ananti-cancer, antiviral, antibacterial, antifungal, immunosuppressivetherapy, or a combination thereof. The one or more therapies may treat,alleviate, or prevent one or more diseases or indications.

Examples of anti-cancer therapies include, but are not limited to,surgery, chemotherapy, radiation therapy, immunotherapy/biologicaltherapy, photodynamic therapy. Anti-cancer therapies may comprisechemotherapeutics, monoclonal antibodies (e.g., rituximab, trastuzumab),cancer vaccines (e.g., therapeutic vaccines, prophylactic vaccines),gene therapy, or combination thereof.

The one or more therapies may comprise an antimicrobial. Generally, anantimicrobial refers to a substance that kills or inhibits the growth ofmicroorganisms such as bacteria, fungi, virus, or protozoans.Antimicrobial drugs either kill microbes (microbicidal) or prevent thegrowth of microbes (microbiostatic). There are mainly two classes ofantimicrobial drugs, those obtained from natural sources (e.g.,antibiotics, protein synthesis inhibitors (such as aminoglycosides,macrolides, tetracyclines, chloramphenicol, polypeptides)) and syntheticagents (e.g., sulphonamides, cotrimoxazole, quinolones). In someinstances, the antimicrobial drug is an antibiotic, anti-viral,anti-fungal, anti-malarial, anti-tuberculosis drug, anti-leprotic, oranti-protozoal.

Antibiotics are generally used to treat bacterial infections.Antibiotics may be divided into two categories: bactericidal antibioticsand bacteriostatic antibiotics. Generally, bactericidals may killbacteria directly where bacteriostatics may prevent them from dividing.Antibiotics may be derived from living organisms or may includesynthetic antimicrobials, such as the sulfonamides. Antibiotics mayinclude aminoglycosides, such as amikacin, gentamicin, kanamycin,neomycin, netilmicin, tobramycin, and paromomycin. Alternatively,antibiotics may be ansamycins (e.g., geldanamycin, herbimycin),cabacephems (e.g., loracarbef), carbapenems (e.g., ertapenem, doripenem,imipenem, cilastatin, meropenem), glycopeptides (e.g., teicoplanin,vancomycin, telavancin), lincosamides (e.g., clindamycin, lincomycin,daptomycin), macrolides (e.g., azithromycin, clarithromycin,dirithromycin, erythromycin, roxithromycin, troleandomycin,telithromycin, spectinomycin, spiramycin), nitrofurans (e.g.,furazolidone, nitrofurantoin), and polypeptides (e.g., bacitracin,colistin, polymyxin B).

In some instances, the antibiotic therapy includes cephalosporins suchas cefadroxil, cefazolin, cefalotin, cefalexin, cefaclor, cefamandole,cefoxitin, cefprozil, cefuroxime, cefixime, cefdinir, cefditoren,cefoperazone, cefotaxime, cefpodoxime, ceftazidime, ceftibuten,ceftizoxime, ceftriaxone, cefepime, ceftaroline fosamil, andceftobiprole.

The antibiotic therapy may also include penicillins. Examples ofpenicillins include amoxicillin, ampicillin, azlocillin, carbenicillin,cloxacillin, dicloxacillin, flucloxacillin, mezlocillin, methicillin,nafcillin, oxacillin, penicillin g, penicillin v, piperacillin,temocillin, and ticarcillin.

Alternatively, quinolines may be used to treat a bacterial infection.Examples of quinilones include ciprofloxacin, enoxacin, gatifloxacin,levofloxacin, lomefloxacin, moxifloxacin, nalidixic acid, norfloxacin,ofloxacin, trovafloxacin, grepafloxacin, sparfloxacin, and temafloxacin.

In some instances, the antibiotic therapy comprises a combination of twoor more therapies. For example, amoxicillin and clavulanate, ampicillinand sulbactam, piperacillin and tazobactam, or ticarcillin andclavulanate may be used to treat a bacterial infection.

Sulfonamides may also be used to treat bacterial infections. Examples ofsulfonamides include, but are not limited to, mafenide,sulfonamidochrysoidine, sulfacetamide, sulfadiazine, silversulfadiazine, sulfamethizole, sulfamethoxazole, sulfanilimide,sulfasalazine, sulfisoxazole, trimethoprim, andtrimethoprim-sulfamethoxazole (co-trimoxazole) (tmp-smx).

Tetracyclines are another example of antibiotics. Tetracyclines mayinhibit the binding of aminoacyl-tRNA to the mRNA-ribosome complex bybinding to the 30S ribosomal subunit in the mRNA translation complex.Tetracyclines include demeclocycline, doxycycline, minocycline,oxytetracycline, and tetracycline. Additional antibiotics that may beused to treat bacterial infections include arsphenamine,chloramphenicol, fosfomycin, fusidic acid, linezolid, metronidazole,mupirocin, platensimycin, quinupristin/dalfopristin, rifaximin,thiamphenicol, tigecycline, tinidazole, clofazimine, dapsone,capreomycin, cycloserine, ethambutol, ethionamide, isoniazid,pyrazinamide, rifampicin, rifamycin, rifabutin, rifapentine, andstreptomycin.

Antiviral therapies are a class of medication used specifically fortreating viral infections. Like antibiotics, specific antivirals areused for specific viruses. They are relatively harmless to the host, andtherefore can be used to treat infections. Antiviral therapies mayinhibit various stages of the viral life cycle. For example, anantiviral therapy may inhibit attachment of the virus to a cellularreceptor. Such antiviral therapies may include agents that mimic thevirus associated protein (VAP and bind to the cellular receptors. Otherantiviral therapies may inhibit viral entry, viral uncoating (e.g.,amantadine, rimantadine, pleconaril), viral synthesis, viralintegration, viral transcription, or viral translation (e.g.,fomivirsen). In some instances, the antiviral therapy is a morpholinoantisense. Antiviral therapies should be distinguished from viricides,which actively deactivate virus particles outside the body.

Many of the antiviral drugs available are designed to treat infectionsby retroviruses, mostly HIV. Antiretroviral drugs may include the classof protease inhibitors, reverse transcriptase inhibitors, and integraseinhibitors. Drugs to treat HIV may include a protease inhibitor (e.g.,invirase, saquinavir, kaletra, lopinavir, lexiva, fosamprenavir, norvir,ritonavir, prezista, duranavir, reyataz, viracept), integrase inhibitor(e.g., raltegravir), transcriptase inhibitor (e.g., abacavir, ziagen,agenerase, amprenavir, aptivus, tipranavir, crixivan, indinavir,fortovase, saquinavir, Intelence™, etravirine, isentress, viread),reverse transcriptase inhibitor (e.g., delavirdine, efavirenz, epivir,hivid, nevirapine, retrovir, AZT, stuvadine, truvada, videx), fusioninhibitor (e.g., fuzeon, enfuvirtide), chemokine coreceptor antagonist(e.g., selzentry, emtriva, emtricitabine, epzicom, or trizivir).Alternatively, antiretroviral therarapies may be combination therapies,such as atripla (e.g., efavirenz, emtricitabine, and tenofoviradisoproxil fumarate) and completer (embricitabine, rilpivirine, andtenofovir disoproxil fumarate). Herpes viruses, best known for causingcold sores and genital herpes, are usually treated with the nucleosideanalogue acyclovir. Viral hepatitis (A-E) are caused by five unrelatedhepatotropic viruses and are also commonly treated with antiviral drugsdepending on the type of infection. Influenza A and B viruses areimportant targets for the development of new influenza treatments toovercome the resistance to existing neuraminidase inhibitors such asoseltamivir.

In some instances, the antiviral therapy may comprise a reversetranscriptase inhibitor. Reverse transcriptase inhibitors may benucleoside reverse transcriptase inhibitors or non-nucleoside reversetranscriptase inhibitors. Nucleoside reverse transcriptase inhibitorsmay include, but are not limited to, combivir, emtriva, epivir, epzicom,hivid, retrovir, trizivir, truvada, videx ec, videx, viread, zerit, andziagen. Non-nucleoside reverse transcriptase inhibitors may compriseedurant, intelence, rescriptor, sustiva, and viramune (immediate releaseor extended release).

Protease inhibitors are another example of antiviral drugs and mayinclude, but are not limited to, agenerase, aptivus, crixivan,fortovase, invirase, kaletra, lexiva, norvir, prezista, reyataz, andviracept. Alternatively, the antiviral therapy may comprise a fusioninhibitor (e.g., enfuviride) or an entry inhibitor (e.g., maraviroc).

Additional examples of antiviral drugs include abacavir, acyclovir,adefovir, amantadine, amprenavir, ampligen, arbidol, atazanavir,atripla, boceprevir, cidofovir, combivir, darunavir, delavirdine,didanosine, docosanol, edoxudine, efavirenz, emtricitabine, enfuvirtide,entecavir, famciclovir, fomivirsen, fosamprenavir, foscarnet, fosfonet,fusion inhibitors, ganciclovir, ibacitabine, imunovir, idoxuridine,imiquimod, indinavir, inosine, integrase inhibitor, interferons (e.g.,interferon type I, II, III), lamivudine, lopinavir, loviride, maraviroc,moroxydine, methisazone, nelfinavir, nevirapine, nexavir, nucleosideanalogues, oseltamivir, peg-interferon alfa-2a, penciclovir, peramivir,pleconaril, podophyllotoxin, protease inhibitors, raltegravir, reversetranscriptase inhibitors, ribavirin, rimantadine, ritonavir, pyramidine,saquinavir, stavudine, tea tree oil, tenofovir, tenofovir disoproxil,tipranavir, trifluridine, trizivir, tromantadine, truvada, valaciclovir,valganciclovir, vicriviroc, vidarabine, viramidine, zalcitabine,zanamivir, and zidovudine.

An antifungal drug is medication that may be used to treat fungalinfections such as athlete's foot, ringworm, candidiasis (thrush),serious systemic infections such as cryptococcal meningitis, and others.Antifungals work by exploiting differences between mammalian and fungalcells to kill off the fungal organism. Unlike bacteria, both fungi andhumans are eukaryotes. Thus, fungal and human cells are similar at themolecular level, making it more difficult to find a target for anantifungal drug to attack that does not also exist in the infectedorganism.

Antiparasitics are a class of medications which are indicated for thetreatment of infection by parasites, such as nematodes, cestodes,trematodes, infectious protozoa, and amoebae. Like antifungals, theymust kill the infecting pest without serious damage to the host.

EXAMPLES

Methods and systems of the present disclosure may be applied to varioustypes of samples, such as nucleic acid samples, protein samples, orother biological samples.

Example 1 Three Independent Workflows for the Production of ESP, HGCPand LRP Libraries

This example provides three independent workflows for the preparation ofExome Supplement Plus (ESP), high GC content (HGCP), and specificenrichment pulldown (LRP) libraries from a single nucleic acid sample.

Illumina's RSB (or 50 mM Sodium Ascorbic) is added to three differentCovaris microtubes containing 1 μg of genomic DNA (DNA) from a singlesample to produce 52.5 μL of total volume in each microtube. Themicrotubes are designated as ESP, HGCP, and LRP. The gDNA in eachmicrotube is sheared using the Covaris settings in Table 1.

TABLE 1 Covaris settings ESP HGCP LRP Duty factor: 20% 20% 20%Cyc/burst: 200 200 200 Time (sec): 80 80 25 Peak Incident Power (W): 5050 50 Temp (° C.): 20 20 20

The microtubes are spun down and 50 μL of the fragmented DNA istransferred to PCR plates. 10 μL of RSB is added to each well. The HGCPsample plate is heated at 65° C. for 5 minutes. The ESP and LRP platesare not heated at 65° C. 40 μL of Illumina's ERP is added to each sampleplate by pipetting up and down to mix. The plates are sealed. The platesare incubated at 30° C. for 30 minutes. The DNA is purified by addingAmpure XP beads to each plate. For the ESP and HGCP plates, 90 μL ofAmpure XP beads are added. For the LRP plate, 50 μL of Ampure XP beadsare added. DNA is eluted with 17.5 μL of RSB.

12.5 μL of Illumina's ATL is added to the eluted DNA and transferred toa new plate. The plates with the eluted DNA are incubated at 37° C. for30 minutes.

Adapters are ligated to the DNA by adding 2.5 μL of RSB, 2.5 μL ofLigation (LIG) mix, and 2.5 μL of adapters to each well. The samples aremixed well and the plate is sealed. The plate is incubated for 10minutes at 30° C. 5 μL of STL (0.5M EDTA) is added to each well. Thesamples are mixed thoroughly. The adapter ligated DNA is purified byadding 42.5 μL of Ampure XP beads to each well. Ligated DNA is elutedwith 50 μL of RSB. The ligated DNA is purified by adding 50 μL of Ampurebeads and eluting the DNA with 20 μL of RSB. The Ampure beadpurification and elution is performed twice.

The ligated DNA is amplified by adding 25 μL of 2× kappa hifi polymeraseand 5 μL of primer to each ligated DNA sample and by running a PCR with8 cycles. The amplified DNA is purified with 50 μL of Ampure beads andthe DNA is eluted with 30 μL of RSB. The amplified DNA from the threedifferent sample preparations are used to prepare the ESP, HGCP, and LRPlibraries.

The ESP, HGCP, and LRP libraries are validated by running each libraryon a DNA 1000 chip and quantifying with a BR Qubit assay.

Hybrization reactions are performed on the ESP, HGCP, and LRP samplesusing ESP, HGCP and LRP specific capture probes. 3 independenthybridization reactions are set up according to Table 2.

TABLE 2 pull down ESP HGCP LRP DNA library ESP HGCP LRP probe ESP HGCPLRP

Hybridization reactions are performed according to Agilent's standardSureSelect protocol.

Example 2 Two Independent Workflows for the Production of ESP, HGCP andLRP Libraries

This example provides two independent workflows for the preparation ofExome Supplement Plus (ESP), high GC content (HGCP) and specificenrichment pulldown (LRP) libraries from a single nucleic acid sample.

RSB (or 50 mM Sodium Acetate) is added to two different Covarismicrotubes containing 1 μg of genomic DNA (DNA) from a single sample toproduce 52.5 μL of total volume in each microtube. The microtubes aredesignated as ESP/HGCP and LRP. The gDNA in the microtube is shearedusing the Covaris settings in Table 3.

TABLE 3 Covaris settings ESP/HGCP LRP Duty factor: 20% 20% Cyc/burst:200 200 Time (sec): 80 25 Peak Incident Power (W): 50 50 Temp (° C.): 2020

The microtubes are spun down and 50 μL of the fragmented DNA istransferred to PCR plates. 10 μL of RSB is added to each well. TheESP/HGCP sample plate is heated at 65° C. for 5 minutes. Or the ESP/HGCPand LRP plates are not heated at 65° C. 40 μL ERP is added to eachsample plate by pipetting up and down to mix. The plates are sealed. Theplates are incubated at 30° C. for 30 minutes. The DNA is purified byadding Ampure XP beads to each plate. For the ESP and HGCP plates, 90 μLof Ampure XP beads is added. For the LRP plate, 50 μL of Ampure XP beadsis added. DNA is eluted with 17.5 μL of RSB.

12.5 μL of ATL is added to the eluted DNA. The plates with the elutedDNA are incubated at 37° C. for 30 minutes.

Adapters are ligated to the DNA by adding 2.5 μL of RSB, 2.5 μL ofLigation (LIG) mix, and 2.5 μL of adapters to each well. The samples aremixed well and the plate is sealed. The plate is incubated for 10minutes at 30° C. 5 μL of STL (0.5M EDTA) is added to each well. Thesamples are mixed thoroughly. The adapter ligated DNA is purified byadding 42.5 μL of Ampure XP beads to each well. Ligated DNA is elutedwith 50 μL of RSB. The ligated DNA is purified by adding 50 μL of Ampurebeads and eluting the DNA with 20 μL of RSB. The Ampure beadpurification and elution are performed twice.

The ligated DNA is amplified by adding 25 μL of 2× kappa hifi polymeraseand 5 μL of primer to each ligated DNA sample and by running a PCR with8 cycles. The amplified DNA is purified with 50 μL of Ampure beads andthe DNA is eluted with 30 μL of RSB. The amplified DNA from the samplepreparations is used to prepare the ESP, HGCP, and LRP libraries.

The ESP, HGCP, and LRP libraries are validated by running each libraryon a DNA High-Sensitivity chip and quantifying with a BR Qubit assay.

Hybrization reactions are performed on the ESP, HGCP, and LRP samplesusing ESP, HGCP and LRP specific capture probes. 3 independenthybridization reactions are set up according to Table 4.

TABLE 4 pull down ESP HGCP LRP DNA library ESP/HGCP ESP/HGCP LRP probeESP HGCP LRP

Hybridization reactions are performed according to Agilent's standardSureSelect protocol.

Example 3 A Single Workflow for the Production of ESP, HGCP and LRPLibraries

This example provides a single workflow for the preparation of ExomeSupplement Plus (ESP), high GC content (HGCP), and specific enrichmentpulldown (LRP) libraries from a single nucleic acid sample.

RSB (or 50mM Sodium Acetate) is added to a Covaris microtube containing3 μg of genomic DNA (DNA) from a single sample to produce 52.5 μL oftotal volume. The gDNA in the microtube is sheared using the Covarissettings in Table 5.

TABLE 5 Covaris settings Duty factor: 20% Cyc/burst: 200 Time (sec): 25Peak Incident Power (W): 50 Temp (° C.): 20

The microtubes are spun down and 50 μL of the fragmented DNA istransferred to a single PCR plate. 10 μL of RSB is added to each well.The sample plate is either heated at 65° C. for 5 minutes or not heated.40 μL ERP is added to each sample plate by pipetting up and down to mix.The plates are sealed. The plate is incubated at 30° C. for 30 minutes.The DNA is purified by adding Ampure XP beads to each plate. 90 μL ofAmpure XP beads is added to the plates. The mixture is incubated for 8minutes at room temperature. The standard Ampure protocol is performed.Beads are rehydrated in 20 μL of thawed RSB for 2 minutes at roomtemperature. 17.5 μL of supernatant is transferred to new wells in anIllumina's ALP plate.

12.5 μL of ATL is added to the eluted DNA. The ALP plates are incubatedat 37° C. for 30 minutes.

Adapters are ligated to the DNA by adding 2.5 μL of RSB, 2.5 μL ofLigation (LIG) mix, and 2.5 μL of adapters to each well. The samples aremixed well and the plate is sealed. The plate is incubated for 10minutes at 30° C. 5 μL of STL (0.5M EDTA) is added to each well. Thesamples are mixed thoroughly. The adapter ligated DNA is purified byadding 42.5 μL of Ampure XP beads to each well. Ligated DNA is elutedwith 100 μL of RSB. 50 μL of Ampure XP beads are added to the 100 μL ofligated DNA. The 150 μL of the supernatant is transferred to a new well,leaving the Ampure XP bead bound DNA in the previous wells. DNA iseluted from the Ampure XP beads by adding 20 μL of RSB. The eluted DNAis the LRP subset.

20 μL of Ampure beads are added to the 150 μL of supernatant. The DNA iseluted in 100 μL of RSB. 60 μL of Ampure XP beads are added to the 100μL of DNA. The 160 μL of supernatant is transferred to a new well,leaving the Ampure XP bead bound DNA in the previous wells. DNA iseluted from the Ampure XP beads by adding 20 μL of RSB. The eluted DNAis the ESP/HGCP subset.

The LRP subset of DNA and the ESP/HGCP subset of DNA are amplified byadding 25 μL of 2× kappa hifi polymerase and 5 μL of primer to eachligated DNA sample and by running a PCR with 8 cycles. The amplified DNAis purified with 50 μL of Ampure XP beads and the beads are rehydratedin 30 μL of RSB. The amplified DNA from the subsets is used to preparethe ESP, HGCP, and LRP libraries.

The ESP, HGCP, and LRP libraries are validated by running each libraryon a DNA High-Sensitivity chip and quantifying with a BR Qubit assay.

Hybrization reactions are performed on the ESP, HGCP, and LRP samplesusing ESP, HGCP and LRP specific capture probes. 3 independenthybridization reactions are set up according to Table 6.

TABLE 6 pull down ESP HGCP LRP DNA library ESP/HGCP ESP/HGCP LRP probeESP HGCP LRP

Hybridization reactions are performed according to Agilent's standardSureSelect protocol.

Example 4 Shear Time and Fragment Sizes

Genomic DNA (gDNA) is sheared by varying the shear time of a Covarissetting. The gDNA fragments produced by various shear times are thenanalyzed. Results are shown in FIG. 5 and Table 7.

TABLE 7 Shear time and mean fragment size Number Shear Time (seconds)Mean Fragment Size (basepairs) 1 375 150 2 175 200 3 80 200 4 40 400 532 500 6 25 800

Example 5 Bead Ratio and Fragment Size

The ratio of the volume of beads to the volume of the nucleic acidsample is varied and the effects of these ratios on mean fragment sizeis analyzed. As can be shown in FIG. 6A, varying the ratio of the volumeof the beads to the volume of the nucleic acid sample from 0.8 (line 1),0.7 (line 2), 0.6 (line 3), 0.5 (line 4) and 0.4 (line 5) results in ashift in the mean size of the DNA fragments. Generally, it appears thatthe lower the ratio, then the larger the mean fragment size.

Example 6 Ligation Reactions and Fragment Size

A combination of two different shear times and three different ligationreactions are conducted on a nucleic acid sample. Sample 1 is shearedfor 25 seconds and a ligation reaction is performed on the long insertDNA as prepared by Step 5 of Example 9 (lig-up). Sample 2 is sheared for32 seconds and a ligation reaction is performed on the long insert DNAas prepared by Step 5 of Example 9 (lig-up). Sample 3 is sheared for 25seconds and a ligation reaction is performed on the mid insert DNA asprepared by Step 8 of Example 9 (lig-mid). Sample 4 is sheared for 32seconds and a ligation reaction is performed on the mid insert DNA asprepared by Step 8 of Example 9 (lig-mid). Sample 5 is sheared for 25seconds and a ligation reaction is performed on the short insert DNA asprepared by Step 11 of Example 9 (lig-low). Sample 6 is sheared for 32seconds and a ligation reaction is performed on the short insert DNA asprepared by Step 11 of Example 9 (lig-low). FIG. 7 shows the meanfragment size for the six reactions.

Example 7 Rhodobacter Sphaeroides

The Rhodobacter sphaeroides ATCC 17025 genome is 4.56 Million basepairslong and the GC content of the genome is analyzed. Results for theanalysis are shown in Table 8.

TABLE 8 Browser Chrom/ Length GC Content NCBI RefSeq Plasmid Name (bp)(%) Gene Count Accession chr 3217726 68.48 3181 NC_009428plasmid_pRSPA01 877879 67.69 849 NC_009429 plasmid_pRSPA02 289489 67.6278 NC_009430 plasmid_pRSPA03 121962 69.36 114 NC_009431 plasmid_pRSPA0436198 64.05 32 NC_009432 plasmid_pRSPA05 13873 58.93 12 NC_009433

Example 8 Optimization of Rhodobacter Sphaeroides DNA (High GC content)

DNA from Rhodobacter Sphaeroides is amplified with a variety ofpolymerases and amplification conditions. Amplified DNA is thensequenced. High GC flowcell refers to sequencing reactions on DNAsamples comprising primarily DNA with high GC content. Mix GC flowcellrefers to sequencing reactions on DNA samples comprising a mixture ofDNA with high and low GC content. As shown in Table 9, brief heating at65° C. before ER (end repair) improves coverage of high GC content DNA(see PST-000292).

TABLE 9 ratio DNA PCR conditions PF reads mapped reads (>80% GC, <60%GC) 2 × 150, High GC flowcell (primarily High GC content) PST- PCR free985736 945734 95.94% 73.90% 000190 PST- kapa 10 cycle 1211930 117430996.90% 70.80% 000191 PST- kapa 10 cycle + 1247310 1206917 96.76% 70.80%000192 betaine PST- kapa 10 cycle + 1183084 1144464 96.74% 70.40% 000193DMSO PST- kapa hifi 10 cycle 1102832 1067306 96.78% 68.30% 000194 PST-kapa hifi 10 cycle + 756856 739857 97.75%  2.40% 000195 deaza_dGTP PST-kapa GC 10 cycle 1299004 1255979 96.69% 70.60% 000196 PST- illumina 10cycle 1347780 1298231 96.32% 52.10% 000197 PST- illumina 10 cycle,1256278 1209607 96.28% 50.60% 000198 long denature PST- kapa 8 cycle1013116 978349 96.57% 69.80% 000199 2 × 150, Mix GC flowcell (high andlow GC content) PST- kapa 10 cycle 909256 854341 93.96% 66.80% 000191 2× 250, Mix GC flowcell (high and low GC content) PST- PCR free 1009022779191 77.22% 70.80% 000290 PST- PCR free, 1100298 863058 78.44% 73.10%000291 60C ER PST- PCR free, 1157944 932378 80.52% 78.20% 000292 65C ERPST- PCR free, 1200318 944391 78.68% 75.10% 000293 70C ER

Example 9 Preparation of Genomic DNA

The following steps are used to prepare subsets of nucleic acidmolecules from a sample comprising genomic DNA:

1. A sample comprising genomic DNA is sheared with M220 for 15-35seconds.

2. The fragmented gDNA is purified with SPRI beads after ligation (ratioof the volume of SPRI beads to the DNA sample was 1) and the DNA iseluted into 100 μL of elution buffer (EB).

3. 50 μL of SPRI beads are added to the 100 μL of DNA.

4. The supernatant is transferred to a new tube.

5. The DNA from the remaining bead bound DNA is eluted. This eluted DNAis called the long insert.

6. 10 μL of SPRI beads are added to the supernatant from Step 4.

7. The supernatant from Step 6 is transferred to a new tube.

8. The DNA from the remaining bead bound DNA of Step 6 is eluted. Thiseluted DNA is called the mid insert.

9. 20 μL of SPRI beads are added to the supernatant from Step 7.

10. The supernatant from Step 9 is transferred to a new tube.

11. The DNA from the remaining bead bound DNA of Step 9 is eluted. Thiseluted DNA is called the short insert.

Example 10 Segregation and Independent Processing of InterpretableGenomic Content

Illumina TruSeq Exome enrichment followed by Illumina sequencing is atypical example of targeted DNA sequencing. However, this process canfail to target many biomedically interesting non-exomic as well asexomic regions for enrichment and also can fail to adequately sequencemany of the regions it does target. Furthermore, many of the sequencedregions may have unacceptably high error rates. It is found that many ofthese gaps and failures are due to specific problems related to bulksequencing that may be more adequately addressed by specializedsequencing protocols or technologies.

A large and unique set of medically interpretable content encompassingboth proprietary data and numerous publicly available sources,includingboth exomic and non-exomic regions, as well as non-reference oralternative sequences, has been compiled. Much of this content is notadequately covered in standard exome sequencing. This performance gaphas been analyzed, and a multipronged approach has been developed tomore completely cover this content by independently processingparticular types of problems with specialized sample preparation,amplification, sequencing technology and/or bioinformatics to bestrecover the underlying sequence. Three targeted subsets and protocolshave been developed to address this performance gap.

In content regions skipped by standard exome processing, but still innominally tractable genomic regions, additional baits have beendeveloped to enrich these regions for standard sequencing. In somecases, non-reference sequences of interest may additionally be targeted(e.g., common normal and/or cancer SV junctions, common InDels or ingeneral sequences in which the reference has a rare allele adverselyaffects enrichment hybridization performance for most of thepopulation). This Exome Supplement Pulldown (ESP) is pooled withstandard exome DNA libraries for very economical sequencing. Table 10lists proprietary and public data sets of medical and research interest,as well as the anticipated coverage gap with Illumina's TruSeq exomekit. Table 10 shows an exemplary list of nucleic acid molecules in theESP subset.

TABLE 10 List of content in the ESP subset Content120924LG.bed Missed inTruSeq by Set Cumulative Missed in TruSeq by Set Cumulative PriorityContent Name Bases Ranges Bases Ranges Bases Ranges Bases Ranges 1MendelDB_snp_0913_2012.bed 78,882 1,576 65,968 1,163 13,844 299 13,844299 2 PharmGKB_snp_0914_2012.bed 14,600 292 80,253 1,444 8,642 17822,486 477 3 medical_dbSNP_regulome1_Suspect.bed 22,005 440 100,5221,836 15,796 323 38,088 794 4 GeneReview_snp_0913_2012.bed 61,615 1,231114,970 2,098 11,459 249 41,520 865 5 HGMD_ClinVar_snp_0913_2012.bed1,062,528 21,220 779,524 12,537 197,594 3,873 219,612 4,313 6Clinical_Channel_snp_0913_2012.bed 771,293 15,404 928,410 15,103 219,0844,239 324,542 6,309 7 OMIM_snp_0914_2012.bed 500,554 9,999 928,41015,103 102,540 2,092 324,542 6,309 8Varimed_multi_ethnic_snp_0914_2012.bed 89,201 1,784 1,002,730 16,56877,177 1,568 391,484 7,660 9 Varirned_highconf_snp_0914_2012.bed1,177,673 23,553 2,032,039 36,733 1,073,682 21,358 1,355,991 26,787 10HGMD_mut.bed 4,305,255 84,885 3,421,232 51,658 556,284 8,900 1,698,98031,491 11 Exons_VIP-Genes-120713 233,123 652 3,604,514 51,812 49,783 3311,738,877 31,691 12 Regulome1_VIP-Genes-120713 3,000 60 3,606,632 51,8502,503 54 1,740,885 31,732 13 Exons_MendelDB-Genes-120916 14,325,63340,761 16,091,103 71,703 3,726,418 19,767 5,049,919 45,516 14Regulome1_MendelDB-Genes-120916 147,553 2,951 16,201,606 73,825 125,1292,553 5,157,542 47,677 15 Exons_HGMD-Genes-120913 26,801,793 74,19529,292,835 105,150 7,047,052 36,440 8,633,799 64,224 16Regulome1_HGMD-Genes-120913 254,356 5,087 29,381,797 106,849 214,2864,372 8,720,429 65,963 17 Exons_CancerGeneCensus_gene 3,786,660 9,92431,022,066 110,788 934,846 4,641 9,113,184 67,842 18Regulome1_CancerGeneCensus_gene 31,651 633 31,031,990 110,976 27,842 5699,122,893 68,033 19 Exons_OMIM_Mendelian_gene 19,484,727 54,28532,499,571 114,516 5,032,796 26,346 9,518,418 69,854 20Regulome1_OMIM_Mendelian_gene 210,906 4,218 32,522,006 114,944 179,6853,668 9,540,226 70,290 21 Exons_HGMD_Mendelian_gene 27,988,755 77,42735,226,837 121,987 7,363,878 37,504 10,174,050 73,377 22Regulome1_HGMD_Mendelian_gene 266,255 5,325 35,260,944 122,647 226,9274,632 10,207,319 74,046 23 Exo ns_HLAclass1 5,969 24 35,260,944 122,6473,554 26 10,207,319 74,046 24 Regulome1_HLAclass1 950 19 35,260,944122,647 743 15 10,207,319 74,046 25 Exons_HLAclass2 28,398 82 35,273,022122,664 11,750 50 10,209,818 74,060 26 Regulome1_HLAclass2 350 735,273,098 122,665 100 2 10,209,868 74,061 27 CFTR_Intronic 603 335,273,651 122,667 603 3 10,210,421 74,063 28 Triallelic_in_Footprint7,891 154 35,280,947 122,809 7,434 150 10,217,712 74,205 29phastConsElements46way-top0.5percent 2,662,784 7,681 36,864,760 126,3521,794,961 5,994 11,753,993 78,509

In content regions having very high GC content (>70%), standardsequencing typically performs poorly because the elevated T_(m) (meltingtemperature) of these areas can cause poor PCR or other amplificationdue to competition with more numerous lower T. sequences andsequenceswith other problematic structures, e.g., hairpins and other secondarystructure. These regions are typically either skipped or perform poorlyin standard sequencing. A process to target content areas of high GCcontent (HGCP) and customized sample preparation and sequencingprotocols to specifically improve the performance of this library havebeen developed by optimizing temperatures, incubation times, buffers,and enzymes. An example composition of such a library intersected withthe content of the HGCP subset is shown in Table 11.

TABLE 11 Exemplary list of content in the HGCP subsetHGCPmerge100_120924LG.bed Includes 50 bp dilation by Set Cumulative HGCPby Set HGCP Cumulative Priority Content Name Bases Ranges Bases RangesBasest Ranges Bases Ranges MendelDB_snp_0913_2012.bed 180,101 909180,101 909 12,539 89 12,539 89 PharmGKB_snp_0914_2012.bed 42,673 266222,474 1,173 3,092 22 15,631 111 medical_dbSNP_regulome1_Suspect.bed62,226 390 281,629 1,547 883 7 16,387 116 GeneReview_snp_0913_2012.bed143,233 753 322,106 1,765 11,074 77 20,586 145HGMD_ClinVar_snp_0913_2012.bed 1,881,403 8,749 1,976,714 9,287 162,863948 166,434 971 Clinical_Channel_snp_0913_2012. bed 1,553,376 7,4422,382,381 11,058 112,701 674 185,369 1,075 OMIM_snp_0914_2012.bed1,091,353 5,527 2,382,381 11,058 92,276 563 185,369 1,075Varimed_multi_ethnic_snp_0914_2012.bed 269,869 1,600 2,609,304 12,3414,066 29 188,395 1,097 Varimed_highconf_snp_0914_2012.bed 3,538,00419,531 5,749,403 29,271 26,929 203 208,093 1,244 HGMD_mut.bed 4,713,06618,283 8,448,253 37,978 448,145 2,157 481,839 2,386Exons_VIP-Genes-120713 301,582 564 8,655,649 38,208 31,335 100 506,8732,445 Regulome1_VIP-Genes-120713 9,180 48 8,661,738 38,231 0 0 506,8732,445 Exons_MendelDB-Genes-120916 18,594,872 32,943 23,550,190 57,8021,986,487 6,270 2,147,437 7,091 Regulome1_MendelDB-Genes-120916 445,3602,606 23,890,795 59,262 9,488 77 2,151,072 7,110 Exons_HGMD-Genes-12091334,574,073 60,810 40,302,799 85,129 3,697,923 11,625 3,918,341 12,444Regulome1_HGMD-Genes-120913 765,607 4,447 40,572,922 86,302 14,220 1153,921,622 12,455 Exons_CancerGeneCensus_gene 4,823,472 8,240 42,627,24189,578 526,402 1,590 4,139,209 13,145 Regulome1_CancerGeneCensus_gene95,341 581 42,657,367 89,729 2,374 17 4,140,165 13,150Exons_OMIM_Mendelian_gene 25,166,172 44,269 44,497,746 92,630 2,702,6768,437 4,340,856 13,745 Regulome1_OMIM_Mendelian_gene 636,246 3,72644,567,099 92,943 14,171 111 4,342,155 13,751 Exons_HGMD_Mendelian_gene36,090,180 63,709 48,011,279 98,805 3,871,482 12,113 4,684,379 14,795Regulome1_HGMD_Mendelian_gene 802,747 4,700 48,116,355 99,290 17,080 1334,685,899 14,802 Exons_HLAclass1 8,432 9 48,116,355 99,290 3,128 34,685,899 14,802 Regulome1_HLAclass1 2,995 10 48,116,355 99,290 0 04,685,899 14,802 Exons_HLAclass2 38,082 65 48,130,838 99,292 1,017 64,686,134 14,804 Regulome1_HLAclass2 913 5 48,130,855 99,292 0 04,686,134 14,804 CFTRJntronic 903 3 48,131,608 99,294 0 0 4,686,13414,804 Triallelic_in_Footprint 23,509 128 48,154,196 99,400 632 44,686,366 14,806 phastConsElements46way-top0.5percent 3,417,667 6,72850,091,253 102,331 162,245 531 4,721,759 14,903

Repetitive elements in the genome and other genomic regions outside ofthe exome can be difficult to sequence, align and/or assemble,particularly with short read technology (e.g., 2×100 on Illumina HiSeq).Many of these regions in the exome are skipped or perform poorly withstandard with standard enrichment strategies. Genomic regions outsidethe exome (such as introns of HLA) are typically not targeted by exomesequencing. The difficulties in sequencing may be due to poor enrichmentefficiency, degenerate mapping of reads and inadequate read length tospan common simple tandem repeats or biomedically relevant expandingrepeats. These problems are addressed by developing a specificenrichment pulldown (LRP) and protocol to extract primarily theseregions for more expensive long paired read sequencing (e.g., 2×250 bpon Illumina Mi Seq) or long single read sequencing (e.g., 5 kb singlemolecule sequencing on PacBio RS or future technologies as available).This longer read sequencing technology is currently 10-fold to several100-fold more expensive per base than bulk sequencing and is often notcurrently commercially viable for the entire content regions.Furthermore, in some cases (e.g., PacBio RS), the raw error profile isproblematic for general use in SNV calling. However, for some types ofimportant problems, these technologies are required to obtain accurateor clinical quality results that correctly map degenerate sequence orspan a repeat sequence. A protocol has been developed in which all suchregions are separated into a subset and are sequenced in parallel toachieve a useful economy of scale for the preparation, yet still limitthe total amount of sequencing to a practical amount. In addition tosequencing these regions with a different technology, the alignment andother bioinformatic pipeline elements are customized to best leveragethese longer reads to improve coverage, accuracy, and characterization(e.g., allelotyping STRs and unstable expanding repeat regions). Phasingand/or haplotyping of HLA and blood typing genes is more tractable usinglonger reads and longer molecules provided in this library. Reassemblyof ambiguous regions is more tractable using the longer molecules andreads from these libraries. An example composition of such a library islisted in Table 12. In addition, the intersection of this library withparticular classes of structural problems or genomic content is shown inthe final block.

TABLE 12 LRPmerge300_120924LG.bed Includes 100 bp dilation by SetCumulative LRP by Set LRP Cumulative Priority Content Name Bases RangesBases Ranges Basest Ranges Bases Ranges MendelDB_snp_0913_2012.bed300,380 823 300,380 823 83,107 238 83,107 238 PharmGKB_snp_0914_2012.bed71,926 258 372,876 1,076 17,087 65 100,722 302medical_dbSNP_regulome1_Suspect.bed 107,508 371 474,890 1,433 77,284 267173,712 560 GeneReview_snp_0913_2012.bed 239,363 690 542,080 1,63572,372 210 194,354 624 HGMD_ClinVar_snp_0913_2012.bed 3,074,892 7,7743,241,005 8,261 945,906 2,543 1,022,992 2,782Clinical_Channel_snp_0913_2012.bed 2,537,774 6,684 3,880,310 9,829723,269 2,045 1,162,383 3,154 OMIM_snp_0914_2012.bed 1,811,109 5,0193,880,310 9,829 554,960 1,605 1,162,383 3,154Varimed_multi_ethnic_snp_0914_2012.bed 484,477 1,439 4,295,467 10,934153,156 419 1,299,826 3,485 Varimed_highconf_snp_0914_2012.bed 6,365,50216,815 9,980,716 25,215 1,574,979 3,476 2,671,355 6,283 HGMD_mutbed7,400,636 15,605 14,095,445 32,199 2,199,289 4,999 3,861,157 8,525Exons_VIP-Genes-120713 387,447 472 14,341,711 32,377 147,978 1943,958,330 8,621 Regulome1_VIP-Genes-120713 15,830 41 14,353,492 32,3875,972 14 3,962,756 8,624 Exons_MendelDB-Genes-120916 23,887,624 26,91332,552,166 47,917 7,780,357 11,130 10,049,097 16,057Regulome1_MendelDB-Genes-120916 794,991 2,347 33,211,287 48,852 150,309470 10,133,001 16,215 Exons_HGMD-Genes-120913 44,321,062 49,86653,983,227 69,498 14,883,355 21,110 17,258,590 25,544Regulome1_HGMD-Genes-120913 1,355,456 4,013 54,494,848 70,283 266,188794 17,324,858 25,670 Exons_CancerGeneCensus_gene 6,121,488 6,82657,070,263 72,967 2,045,184 2,809 18,218,353 26,867Regulome1_CancerGeneCensus_gene 173,742 520 57,133,995 73,058 27,135 9418,229,400 26,886 Exons_OMIM_Mendelian_gene 32,224,942 36,331 59,444,38775,405 10,505,055 15,008 18,976,117 27,934 Regulome1_OMIM_Mendelian_gene1,129,215 3,368 59,571,025 75,634 191,633 620 18,989,192 27,963Exons_HGMD_Mendelian_gene 46,253,280 52,403 63,975,577 80,407 15,161,05921,685 20,193,619 29,734 Regulome1_HGMD_Mendelian_gene 1,416,254 4,27364,173,958 80,756 248,193 810 20,221,296 29,790 Exons_HLAclass1 10,894 364,173,958 80,756 10,894 3 20,221,296 29,790 Regulome1_HLAclass1 5,912 664,173,958 80,756 5,912 6 20,221,296 29,790 Exons_HLAclass2 50,951 4264,188,111 80,758 50,951 42 20,235,449 29,792 Regulome1_HLAclass2 2,1963 64,188,111 80,758 2,196 3 20,235,449 29,792 CCTR_Intronic 1,203 364,189,624 80,759 0 0 20,235,449 29,792 Triallelic_in_Footprint 39,060117 64,232,417 80,836 36,957 109 20,276,075 29,864phastConsElements46way-top0.5percent 4,269,414 6,189 66,642,340 83,2431,011,328 1,857 20,709,245 30,497 HLA-ClassI 22,744 3 66,651,185 83,23822,744 3 20,718,090 30,492 HLA-ClassII 140,811 10 66,728,362 83,207140,811 10 20,795,267 30,461 BloodTypingf10 k 206,568 3 66,902,78583,169 206,568 3 20,972,246 30,428 AmylaseRegion 300,200 1 67,200,75283,169 300,200 1 21,271,226 30,427 ImportantCompressions 192,112 367,379,340 83,156 192,112 3 21,449,546 30,416 SMN1_SMN2 57,657 267,428,632 83,146 57,657 2 21,498,838 30,406 by Set Cumulative LRP bySet LRP Cumulative Priority Problem Name Bases Ranges Bases Ranges BasesRanges Bases Ranges v3NoCoverage 51,511,046 26,632 51,506,246 26,608756,946 947 756,946 947 ShortPEReadMappabilty 131,941,961 31,071135,610,684 40,898 1,415,767 1,581 1,445,847 1,660 SingleReadMappabilty239,094,172 149,523 241,428,271 156,194 2,364,928 3,258 2,382,194 3,311ValidatedCompressions 3,262,543 36 244,073,867 155,934 54,267 712,427,518 3,363 SegmentalDuplications 162,351,720 6,902 287,784,823145,772 4,084,205 4,129 4,393,421 4,945 STR > 50 bp 128,885,115 201,050395,004,522 297,226 1,209,546 2,883 5,359,052 7,494 GRCh37patches61,247,019 134 433,409,533 292,018 3,302,921 3,265 7,821,274 9,930v3LowCoverage 746,244,957 683,267 966,704,005 689,651 15,671,569 25,81220,709,245 30,497 HLA-Classl 22,744 3 966,704,005 689,651 22,744 320,718,090 30,492 HLA-Classll 140,811 10 966,704,005 689,651 140,811 1020,795,267 30,461 BloodTypingf10 k 206,568 3 966,723,788 689,642 206,5683 20,972,246 30,428 AmylaseRegion 300,200 1 966,815,712 689,618 300,2001 21,271,226 30,427 ImportantCompressions 192,112 3 966,817,041 689,617192,112 3 21,449,546 30,416 SMN1_SMN2 57,657 2 966,817,041 689,61757,657 2 21,498,838 30,406

All three of these libraries have preliminary data combining standardTruSeq Exome and ESP to produce libraries which are called Exome+,Extended Exome, and ACE (Accuracy and Content Enhanced) Exome (Tables13-14). These libraries significantly improve coverage of the RefSeqexons, our customized Exome, as well as dramatically improve thecoverage of customized Variants (as many of these are outside theexome).

TABLE 13 Whole Genome Whole Exome Full Flowcell TruSeqExome ExtendedExome (E+, ESP) (PL 2.0) (PL 3.0) B (Alpha

 in

) A Product Category

Description of Product Type Set Size 92× 54× 56× comparison metrics

Price Point Size Unit A B A B A B A B Bronze NEMAR corrections

Genome NEMAR.

1,102 kbp

64,324 23,408 3,170 31,894

allele “variants”

NEMAR.RefSeq

17,998 kbp 12,488 1,058 10,436 3,048 31,224 1,048 removed “variants”NEMAR.RefSeq

kbp 4,713

4,400 439 4,523

NEMAR.RefSeqUTR 11,307 kbp

676 5,978 826 4,851 653 NEMAR.

7,394 kbp

446 4,935 468 NEMAR.

4,080 kbp 2,171

376 72 1,851 371 NEMAR.

11,080 kbp

4,547

6,573

Reference

Genome 2,861 Mbp 98.7% 51  3.3% 50  5.1% 49 Coverage % of

RefSeq

70,467 kbp 98.7% 52 77.9% 48 88.1% 48 target reported

RefSeq

33,366 kbp 99.0% 53 85.4% 48 94.7% 48

RefSeqUTR 37,101 kbp 98.4% 52

50 80.3% 48

PersonalExome 29,056 kbp 99.2% 53 81.3% 48 94.2% 48

PersonalVariants

172 kbp 99.8% 49 83.0% 48 96.0% 47

PersonalNetContent 29,085 kbp 99.3% 53 80.0%

93.8% 48 Content.

2,389 kbp 99.3% 53 74.0% 47 89.0% 18 Content.

1,639 kbp 99.0% 53

48 85.6% 17 Content.

903 kbp 71.0% 46 54.0% 38 83.1% 38 Content.

2,086 kbp 89.1% 43 71.0% 38 77.5% 37 Content.

138 kbp 87.2% 44 57.7% 41 77.3% 37 Content.

453 kbp

42

34 33.2% 33 SNPs

Genome 2,861 Mbp 95.3% 0.34%  1.9%  1.85%  3.0% 1.20% Coverage % ofError: % RefSeq

70,467 kbp 95.8% 0.21% 61.9%  1.60% 78.6% 1.46% target reported

RefSeq

33,366 kbp 95.0% 0.19%

 1.83% 85.4% 1.50%

RefSeqUTR 37,101 kbp 96.3% 0.22% 57.2%  1.80% 73.3% 1.44%

Personal

29,056 kbp 95.6% 0.21% 64.0%  1.50% 83.3% 1.30%

PersonalVariants

172 kbp 99.8% 0.02% 12.8%  0.31% 72.7% 0.10%

PersonalNetContent 29,085 kbp 97.0% 0.14% 44.1%  1.44% 79.3% 0.93%Content.

2,389 kbp 92.8% 0.10% 60.3%  1.17% 76.5% 0.74% Content.

1,639 kbp 93.2% 0.11% 24.3%  1.21% 47.9% 0.82% Content.

903 kbp 85.4% 1.06% 46.3%

56.3% 8.72% Content.

2,086 kbp 83.6% 0.87% 56.2%  7.00%

7.03% Content.

138 kbp 83.8% 1.28% 23.1% 13.64%

7.84% Content.

453 kbp 76.6% 1.29% 39.1%

9.94%

indicates data missing or illegible when filed

TABLE 14 Reference Loci All Loci Genomic Phred Variant Loci Phred RegionLibrary HQ Cov Error HQ Cov Error % HQ Cov Error Region DefinitionRefSeq TruSeq 79.6% 49 64.0% 1.55% 78.6% 46 All exons and UTRs Exome+88.1% 48 76.6% 1.46% 87.3% 46 Interpretable TruSeq 83.2% 49 65.8% 1.43%82.3% 47 46 PharmGKB VIP genes Exome Exome+ 94.2% 48 83.3% 1.30% 93.6%46 3,502 HGMD genes 488 Cancer genes 1,803 MendelDB genes 2,896 OMIMMendelian genes 3,493 HGMD Mendelian genes (90-95% of symbols covered inESP v1) Interpretable TruSeq 85.2% 46 13.3% 0.24% 78.6% 43 MendelDB SNPVariants Exome+ 96.0% 47 72.7% 0.10% 93.9% 41 PharmGKB SNP Medical dbSNPRegulome1 suspect GeneReview SNP HGMD Clinvar SNP Clinical Channel SNPOMIM SNP Varimed Multiethnic SNP Varimed High Confidence SNP

Example 11 Detection of Genome-Wide Copy Number Variations with SingleReads Whole Genome Sequencing Data

A nucleic acid sample is obtained from a subject and further analyzed bywhole genome sequencing to identify the copy number variations in thegenome. Single run whole genome sequencing is conducted on the nucleicacid sample 43 times, resulting in 43 single runs of whole genomesequencing data. The single run whole genome sequencing covers 3gigabasepairs. Paired 100 bp molecules are used. Each of the single runwhole genome sequencing data is analyzed using a Hidden Markov Model todetect copy number variations. Genomic bins of 1 kbp, 5 kbp, 10 kbp, and20 kbp are used for measurement of the number of sequence reads per bin.

A single run whole genome sequencing data with paired 100 bp moleculesproduces an average of n=250 molecules in a given 50 kb region. Thisquantity is nominally a Poisson random variate (neglectingnon-fundamental noise sources). The SNR of a heterozygous deletion isroughly 10˜sqrt(n/2) in this example. As shown in FIG. 3 , duplicationsand other copy number increases are found to have a higher SNR. As shownin FIG. 4 , successful detection of a heterozygous deletion in thesingle runs whole genome sequencing data is achieved.

As shown in FIG. 5 , the sensitivity to small copy number variationsincreases as the size of genomic bin gets smaller. However, the falsepositive detection of copy number variations decreases as the size ofthe genomic bins increases.

Example 12 Detection of Systematic Read-Depth Variation with SingleReads Whole Genome Sequencing Data

Other factors can lead to systematic read-depth variation, besides thepresence of copy number variations. Many of these factors tend to besystematic across samples.

Single run whole genome sequencing data is obtained for a plurality ofsamples and analyzed using a Hidden Markov Model. The signal read-depthof each read is averaged across 5 samples or more than 5 samples. Manyof the factors can be systematic across samples. As shown in FIG. 6 , byaccumulating a database of the average read depth across many samples,as well as the variance across samples, these systematic regions areidentified and filtered out, which would otherwise correspond to falsepositive CNV detections, in a naive application of the Hidden MarkovModel method.

Example 13 Whole Exome Sequencing Supplemented with Single Reads WholeGenome Sequencing Data

Whole exome sequencing is conducted and analyzed using methods describedin earlier Examples. Low coverage whole genome sequencing is used as asupplement to targeted exome sequencing. The low coverage genome datacovers 1-10 gigabasepairs and is analyzed for coverage in genomic binsof 100-1,000,000 basepairs to assess CNV of the sequence in the sample.A single run low coverage whole genome sequencing data covering 3gigabasepairs adds $100-$200 to the exome sequencing costs and deliversgenome-wide SV sensitivity from <50 kb upwards. In addition, variantsdetected in the low coverage whole genome data can be used to identifyknown haplotype blocks and impute variants over the whole genome with orwithout exome data.

Example 14 Detection of Genome-Wide Copy Number Variations withOff-Target Whole Exome Sequencing Data

Whole exome sequencing is conducted, enriched and analyzed using methodsdescribed in earlier Examples. The exonic and non-exonic sequence dataof the whole exome sequencing data is annotated. It is observed thatabout 10% or more of the whole exome sequencing data comprisesnon-exonic data or off-target reads. Furthermore, it is observed thatthese non-exonic data are spread fairly uniformly across the entiregenome.

Non-exonic data is analyzed using a Hidden Markov Model. Certain regionsthat are affected by enrichment are excluded from the analysis. Copynumber variations and structural variants are detected using theremaining set of reads distributed over the entire genome.

Example 15 Detection of Genome-Wide Copy Number Variations withOff-Target Whole Exome Sequencing Data and a Low Coverage Whole GenomeSequencing Data

Whole exome sequencing is conducted using a nucleic acid sample. Data isenriched and analyzed using methods described in earlier Examples. A 0.5read of whole genome sequencing is conducted on the same nucleic acidsample. The exonic and non-exonic sequence data of the whole exomesequencing data is annotated. Non-exonic data and the 0.5 read wholegenome sequencing data are combined and analyzed using a Hidden MarkovModel. Certain regions that are affected by enrichment are excluded fromthe analysis. Copy number variations and structural variants aredetected over the entire genome.

As used herein and in the appended claims, the singular forms “a”, “an”,and “the” include plural referents unless the context clearly dictatesotherwise. Thus, for example, reference to “a cell” includes a pluralityof such cells and reference to “the peptide” includes reference to oneor more peptides and equivalents thereof, e.g., polypeptides, known tothose skilled in the art, and so forth.

Methods and systems of the present disclosure can be combined withand/or modified by other methods and systems, such as those described inU.S. Patent Publication No. 2014/0200147 (“Methods and Systems forGenetic Analysis”), which is entirely incorporated herein by reference.

While preferred embodiments of the present invention have been shown anddescribed herein, it will be obvious to those skilled in the art thatsuch embodiments are provided by way of example only. It is not intendedthat the invention be limited by the specific examples provided withinthe specification. While the invention has been described with referenceto the aforementioned specification, the descriptions and illustrationsof the embodiments herein are not meant to be construed in a limitingsense. Numerous variations, changes, and substitutions will now occur tothose skilled in the art without departing from the invention.Furthermore, it shall be understood that all aspects of the inventionare not limited to the specific depictions, configurations or relativeproportions set forth herein which depend upon a variety of conditionsand variables. It should be understood that various alternatives to theembodiments of the invention described herein may be employed inpracticing the invention. It is therefore contemplated that theinvention shall also cover any such alternatives, modifications,variations or equivalents. It is intended that the following claimsdefine the scope of the invention and that methods and structures withinthe scope of these claims and their equivalents be covered thereby.

1.-28. (canceled)
 29. A computer-implemented method, comprising: (a)receiving, by a computing system, a set of data resulting from wholegenome sequencing performed on a free DNA nucleic acid sample of asubject, wherein the free DNA nucleic acid sample is isolated from oneor more of plasma or serum; (b) providing, by the computing system, asinput to one or more models, one or more features of the set of data;(c) generating, by the computing system using the one or more models,output detecting one or more genomic regions within the set of data,wherein the one or more detected genomic regions comprise polymorphisms;and (d) generating, by the computing system, using an evaluation of thedetected genomic regions, output indicating a likelihood of cancertumor.
 30. The computer-implemented method of claim 29, wherein the setof data further results from nucleic acid amplification prior to thewhole genome sequencing.
 31. The computer-implemented method of claim29, further comprising: selecting, by the computing system, the one ormore features.
 32. The computer-implemented method of claim 31, whereinthe selecting the one or more features comprises the computing systemusing one or more of filter techniques, wrapper methods, embeddedtechniques, Benjamini-Hochberg procedures, Analysis of Variance (ANOVA),Wilcoxon approaches, or Threshold Number of Misclassification (TNoM).33. The computer-implemented method of claim 29, wherein the one or moremodels comprise one or more of Hidden Markov Models (HMMs), randomforest models, or Support Vector Machine (SVM) models.
 34. Thecomputer-implemented method of claim 29, wherein the generation of theoutput indicating the likelihood of cancer tumor comprises the computingsystem using one or more of HMMs, ANOVA, or Principal Component Analysis(PCA).
 35. The computer-implemented method of claim 29, wherein saiddetected genomic regions comprise one or more of base changes,insertions, deletions, repeats, transversions, or copy number variants(CNVs).
 36. A system, comprising: at least one processor; and a memorystoring instructions that, when executed by the at least one processor,cause the system to perform: (a) receiving a set of data resulting fromwhole genome sequencing performed on a free DNA nucleic acid sample of asubject, wherein the free DNA nucleic acid sample is isolated from oneor more of plasma or serum; (b) providing, as input to one or moremodels, one or more features of the set of data; (c) generating, usingthe one or more models, output detecting one or more genomic regionswithin the set of data, wherein the one or more detected genomic regionscomprise polymorphisms; and (d) generating, using an evaluation of thedetected genomic regions, output indicating a likelihood of cancertumor.